Goodbye cron, hello Airflow

By Yuriy Senko

Elevator Pitch

My talk is about our experience with replacing of a legacy ETL system written in bash, python, and cron with a new system built on top of Apache Airflow. I want to concentrate on - Airflow typical use cases - design, implementation, and testing of Airflow pipelines - CI/CD and Kubernetes deployment

Description

Airflow is a workflow automation and scheduling system that can be used to author and manage data pipelines. My talk is about replacing of a legacy data processing system with Apache Airflow. I’m going to share some practical experience of designing Airflow pipelines and running them in a Kubernetes cluster including: - typical Airflow use cases - design, implement and test Airflow pipelines - extend Airflow with your own tasks - run Airflow in Kubernetes - monitor Airflow deployment

Notes

Here is a preliminary content of my talk:

  • Description of a problem we faced with (legacy project with tons of bash scripts, python services, cron. No monitoring or alerts, support engineers have root access to prod and fix problems by running bash commands directly on prod).
  • What is wrong with bash, cron, and tons of python services in terms of data processing pipelines.
  • Why Apache Airflow (comparison with other solutions, and main benefits and drawbacks of Airflow).
  • How to design Airflow pipelines (DAGs):
    • Typical use cases
    • DAGs - design, test, and release
    • How (why and when) to extend Airflow with your own tasks
    • Scheduling
  • Airflow and Kubernetes - perfect match.
    • CI/CD pipelines
    • Monitoring and alerts

I’m also going to have at least 1 dry run of my presentation inside my company before the event, and adjust the content, timings, etc. if needed.