Large scale Data Lakes and Fast Warehousing

By Jaideep Khandelwal

Elevator Pitch

The talk is a tale of of solutions to setup large scale Data Lakes and Fast Warehousing, deployed with state public transportation department in India, covering the challenges encountered, why a particular solution was picked and the essential parts of the solutions.

Description

Business Problem

How would you go about creating a resilient and reliable source of truth, when you have over 50 million data points generated everyday. The velocity and volume of data increasing to a level wherein a general query could take anywhere between 2 to 24 hours to respond.

This was a recent problem we worked upon state transport buses in India. There were IoT enabled devices installed in buses, generating huge volumes of events and spatial data. This data had to be ingested properly, stored and analyzed for different users.

Topics covered

The solution proposed included setting up a Data Warehouse, creating Data Marts for consumers, implement ETL algorithms and setting up OLAP cubes allowing for ‘slicing and dicing’ within permissible time limits.

The proposed talk would include why was such a solution chosen, implementation process, the challenges faced along the way, the tools used and how the final results added up. We’ll have a quick comparison between different tools available and why some were chose over the other. We’ll also touch upon various concepts of large scale Data Warehouse Systems like:

  • De-duplication of data.
  • Tacking Slow Changing Dimensions(SCD).
  • SnowFlake to Star Schema.
  • Idempotence.
  • Discrete and rolling Rate Limitations
  • Ordered Delivery
  • Distributed Wait Groups
  • Batching a Stream with time of Spatial thresholds

Who Should Attend

This talk would be useful for anyone involved with data and analytics systems - including Product Managers, Data Engineers, Software Architects, Engineers as well as people working in the Business Intelligence and Analytics teams.

Notes

With a background in Distributed Systems, I have recently moved to big data processing. This talk is a case study of a solution that was implemented around the given business problem and my hands on experience in plumbing and setting data pipelines to create Data warehouses, I will be able to share the experience with people who are doing data engineering or want to setup data pipelines in their organizations.