New in DataStreams.jl: Type flexibility, querying, and parallelism, oh my!

By Jacob Quinn

Elevator Pitch

What’s new in the data framework package powering some of the most popular data packages in Julia? Come learn about advancements in flexibly typed schemas, querying functionality directly integrated with IO, and automatic data transfer parallelism.

Description

The DataStreams.jl framework is behind a number of key packages in the Julia data ecosystem. At it’s core, it defines the “Source” and “Sink” interfaces that various formats can implement to automatically integrate with other formats that also implement the interfaces. This solves the one-to-many interop problem that always plagues data formats (“what? it only takes CSV files??”). With DataStreams, it’s quick and easy to implement the interface and automatically hook into the rest of the Julia data ecosystem.

So what’s new and noteworthy in DataStreams?

  • Flexibly typed schemas: a long-standing issue with any sort of data transfer is how to align expected types between source and sink; Base julia itself has pioneered a flexible, yet performant solution in it’s implementation of map over collections. This same approach has been applied to DataStreams to allow dynamic, type-inference-independent transfer from sources to sinks.
  • IO-integrated querying functionality: how many times have you thought, “sheesh, I wish there was a way to only parse a few columns, filter out certain values, and apply a transformation to this csv file all at the same time!” Ok, maybe not those words exactly, but with DataStreams, now you can! A fully integrated query-planner can now take any number of transformations and apply them at the IO-level to avoid more data transfer than absolutely necessary and fuse them all together in tightly compiled Julia code.
  • Data transfer parallelism: Sinks can now signal to Sources that they support parallel streaming; this can lead to massive improvements in data throughput and fully leveraging a system’s resources

Notes

Primary author of DataStreams.jl. This talk would be a bit of a peek under the hood since most users don’t interact directly with DataStreams (but w/ the data format packages themselves). It’s most useful for users hoping to better understand the DataStreams interfaces and how to fully leverage all the available functionality. Could also be a 30-minute talk, where I could go more in-depth into some of the interface changes and possibly even walk through a full implementation.