Elevator Pitch
Programming with Spark’s DataFrames in Scala is like driving your last generation car with all its shiny safety features disabled. Leave the Pythonist style behind, and learn how to leverage the Scala type system in Spark with doric. On-demand type-safety and modularity with DataFrames at no cost!
Description
Every data engineer struggles every day against lots of hazards on their way to deploy new pipelines: unannounced schema changes, single source files of thousands of lines of code, large stack traces, etc. This talk shows how to avoid these problems by leveraging the Scala type system using doric, an open-source library that adds a thin layer of static typing to Spark DataFrames. Through common and practical use cases, we will see how doric opens the door to complying with modularity, reuse, testability, and so forth, when the time comes to write complex and optimal programs with Spark DataFrames - something you will hardly be able to do with PySpark!
Notes
Doric is an open source library 100% compatible with any spark cluster with spark 2.4 or newer. It has a growing number of stars in GitHub and it’s starting to be used in production at several places. We are really happy to see the team of contributors growing as well. There are some features we are working on right now to improve its capabilities, but all in all, in its current state, we believe it makes a strong case in favour of Scala when using Spark DataFrames.
We have already spoken about doric in conferences like Scala Love and ScalaCon, as well as in Scala Meetup groups in Spain and Belgium. This time we will focus more on practical use cases, and less on the inner details of the library, so as to provide practical advice and comparison with non-working alternatives like Datasets.