Fast and furiously safe: type-safe programming with Spark DataFrames

By Alfonso Roa

Elevator Pitch

Apache Spark poses a major dilemma to programmers: if you stand for performance, then DataFrames are the option; if you strive for modularity then you must sacrifice some performance and stick to Datasets. This is too bad … and unnecessary! How can we achive both with a better DSL for DataFrames?

Description

The all-popular Big-data programming framework Spark poses a major dilemma to programmers: if you stand for performance, then DataFrames are the option; if you strive for modularity then you must sacrifice some performance and stick to Datasets. This is too bad … and unnecessary! We present doric, a new open-source Scala library that offers type-safety in DataFrame column expressions at a minimum cost, without compromising performance, and with minimum changes to your accustomed syntax. The talk guides the listener through some common problems that we commonly encounter with Spark DataFrames, and shows how doric helps us to overcome them: Run DataFrames only when it is safe to do so. Spark will detect references to non-existing columns and will complain about that with an exception at runtime. However, Spark won’t be able to detect references to existing columns of wrong types, since type expectations are not encoded in plain columns. Using doric, we can prevent the creation of the DataFrame, since column expressions are typed. Get rid of malformed column expressions at compile time. Spark complains if we try to add integers with booleans, but it complains too late, with an exception raised at runtime. Using doric, there is no need to wait for so long: errors will be reported at compile-time! Get all errors at once. Unlike Spark, doric won’t stop at the first error: it will keep accumulating problems until no further one is found. And with a hint to the exact location of errors! Modularize your business logic. Get rid of boilerplate code and copy-paste! Doric’s combination of error location and type safety is the key enabler of a principled approach to writing libraries of column expressions and enhancing the modularity of our DataFrames.

Notes

The talk is about a DSL we are developing as an open-source project that improves the DataFrame API introducing column types to make it easier to use, readable, modular, and maintainable. The talk requires a basic knowledge of the spark API to see the differences but no need to be an expert in big data or the insides of Spark. I’m the main developer of the project and I will focus the talk on why type safety is a must when developing complex ETLS comparing common Spark with the Doric API.