Elevator Pitch
Developing ETL workflows is a minefield of challenges, from schema drifts to performance bottlenecks. In this talk, we’ll discuss how Temporal’s durable execution primitives enable developers to solve the challenges of building reliable and testable ETL pipelines.
Description
Building reliable ETL pipelines is a challenge in modern data systems. While the premise seems straightforward — extract, transform, and load data — the reality is far more complex. Data engineers frequently grapple with schema changes that break pipelines, performance bottlenecks caused by massive data volumes and upstream data sources that fail in subtle ways.
Traditional solutions for ETL need a lot of duplicated work to add all the sophisticated retry mechanisms and durability primitives. This leads to fragile systems that are expensive to maintain and difficult to evolve. Due to these reasons, Temporal proves to be an excellent choice for building ETL pipelines.
In this talk, we’ll explore how Temporal’s durable execution framework provides elegant solutions to challenges that are specific to ETL pipelines. We’ll dive deep into how features like activity heartbeats and child workflows help perform data processing reliably.
We’ll discuss managing schema drifts, navigating resource issues for high data volume, robust error handling and retries, setting up telemetry and observability, maintaining and evolving pipeline logic and e2e testing strategies with synthetic data in CI/CD. This will be paired with a live demo of an example temporal workflow to demonstrate all the above patterns working together.
Outline for the live demo would be as follows:
-
Running an ETL Workflow:
- Demonstrate data fetching, and processing.
- Simulate failure scenarios and recovery using Temporal’s durable state management.
-
Testing Workflow:
- Setting up unit tests for activities using the Temporal SDK.
- End-to-end testing on GitHub Actions using Temporal CLI.
We’ll end the talk with a discussion on how Temporal’s testing framework enables comprehensive validation, from unit tests to end-to-end scenarios, all integrated seamlessly with modern CI/CD practices.
Benefits to the Ecosystem
- Temporal is a leading platform for building durable execution workflows, this makes it a default choice for building ETL pipelines. And we propose this talk to discuss a structured set of patterns for designing ETL pipelines using Temporal.
- Attendees will learn how to leverage durable execution primitives to build ETL systems that are both resilient to failures and maintainable as they evolve.
- Attendees also gain practical skills for testing and debugging workflows efficiently with Temporal CLI and GitHub Actions.
We look forward to this talk becoming a major resource in the community for solutions to real-world problems around building ETL workflows with Temporal.
Audience
- This talk is suitable for backend engineers, data engineers and engineering leaders who are involved in building workflows for data processing. Especially for teams who want to move beyond traditional fragile ETL solutions to robust data pipelines.
- The ideal experience level would be mid-level to senior backend engineers, individuals who may work on ETL workflows, data pipelines, or other durable task orchestration challenges. Engineers familiar with orchestration tools or workflow engines but new to Temporal will find immense value in this talk.
Notes
Technical Requirements
- Familiarity with Databases
- Familiarity with Github Actions, or any CI/CD systems.
ETL Challenges
The following is a list of ETL challenges we plan to discuss along with their solution using temporal.
- Schema Drifts and Data Quality - Using Temporal’s signal and query handlers to manage and communicate evolving schemas in the workflow
- Mature Concurrency to Prevent Abuse - Using Temporal’s worker concurrency configurations to optimise resource utilization and prevent abuse.
- Error Handling and Retries - Using Temporal’s retry policies that can be configured for different failure scenarios and heartbeat mechanism to ensure long-running workflows stay healthy and allow detecting stuck or failed processes in real-time
- Observability Challenges - Temporal’s out-of-the-box OpenTelemetry (OTEL) support makes it easy to observe workflows and build alerting flows.
- Unit and E2E Testing - Temporal’s testing frameworks support multiple levels of validation. Unit testing for individual workflow components and E2E testing of complete workflows and then using CI/CD to run full integration test suites.
- Security and PII Data - Temporal’s secret management and encryption capabilities ensure sensitive data like credentials, PII is handled securely.
- Maintenance and Evolution - Temporal’s polyglot SDK support and code-first approach make it easier to version pipeline logic. Being code-first allows for a mature developer experience.
Speaker Experience
- Junaid has previously presented at ArgoCon Europe and DuckCon North America. He is currently part of the team which is responsible for building pipelines that pull metadata from all major databases, data warehouses and BI tools in the modern data stack. He has been an early engineer at Atlan where he helped implement and scale the ETL Platform.
- Nishchith has previously presented at ArgoCon North America. He is currently working with the Platform team building the next generation of the ETL platform. He was one of the early engineers at Atlan and helped implement and scale the federated query engine using Apache Calcite.
For context, Atlan is a control plane for humans of data to discover, trust and collaborate on data assets by bringing in metadata from various sources. Just like Github is for engineering teams, Figma is for Design teams - Atlan is for data teams to collaborate.