Don't use spark, no really, please

By Elad Amit

Elevator Pitch

Over the past half decade or so, the hype around spark streaming has grown to such an extent that it is hard to discern myth from reality
Our experience with it in production has been such a disappointment that we would like to share our findings and when we think it should be avoided

Description

1/2 minute - introduction
1.5 minutes - the spark promise
2.5 minutes - the reasons we found it unfit for our production use case, which people don’t blog about :(:
- the kafka direct stream instability vs. the kafka receiver based robustness
- the coupling of applicative state (i.e. what has to survive failure) and operational state (e.g. the jar, receiver execution locality)
- autoscaling - good luck (i.e. the amount of time it takes the yarn-spark cluster to add a node and how the input stream handles this)
- how i’ve come to loath all in one solutions (i.e. a shuffle service which take 15 minutes to restart should be replaceable)
1/2 minute - recap