Reducing Accretion In Monoliths With Stream Processing

By Akshay Gupta

Elevator Pitch

In this talk, we’ll walk through our journey in migrating critical payment flows out of a monolithic order management system (OMS) for Ride-hailing services at GO-JEK to a framework in Clojure that composes event streaming, reliable task scheduling and configuration management in a harmonious way.

Description

Intro [5 min]


What is up?

Over the years, the GO-JEK Ride-hailing Order Management System (OMS) has become a large monolith that is notorious for its uptime and maintainability issues. The dependency matrix is large and continues to grow. Releasing features is becoming harder, testing and QA cycles are long and consequently deploy cycles are time consuming.

What is this about?

This talk is about our attempts to deal with this problem. We’ve built a framework in Clojure that provides a spoon-fed mechanism to quickly write asynchronous task runners. The backbone of these is GO-JEK’s already humongous and battle-hardened Kafka infrastructure. Instead of growing the monolith vertically, we’ve extracted actors for the various essential but non-synchronous tasks. This extraction process is ongoing and this talk is about our journey in re-thinking deeply about our existing flows, migrating our Sidekiq jobs to these actors in Clojure and the positive outlook in which we want this process to reach its logical end.

Who are we?

We’re a part of the Rider Platform team. We manage the OMS and various auxiliary services that deal with creating and completing orders for all the Transport services at GO-JEK across Southeast Asia (bike, trike, car and taxis).

Flashback to one year ago [5 min]


  • Still early days for the OMS, but signs of accretion already. It’s been just a year since we’ve pulled ourselves out of another, even bigger, harder-to-maintain monolith that would basically handle everything for the entirety of GO-JEK. After a year since, it smells like it’s not going to go well for our little Transport OMS extraction either.
  • A huge dependency matrix that appears inevitable on paper.
  • One of the more critical flows, of processing payments, going through 4 different sub-systems, 2 of which are now grossly unmaintained. Making changes is almost a sprint-long effort.

Cometh the hour… Cometh the Actor [10 min]


  • One of our early trudges into Clojure was to pull out all app notifications off of the OMS and into a service that reads messages from Kafka, parses them and dispatches the appropriate notification to the notification service.
  • Thanks to JVM ecosystem, hooking up to Kafka Streams was painless.
  • Multi-method dispatch came in quite handy to organize the code in a repeatable, data-oriented way.
  • Though the service was successful and still continues to function with little to no infrastructural upgrade, we all felt like there was an opportunity here to do the same for various asynchronous tasks that we have in the OMS if there were a simple repeatable way to make services that, in a nutshell:
    1. Read from Kafka
    2. Do some work
    3. If the work fails, you retry, log and continue
    4. Monitor and provision infrastructure appropriately
  • Convinced, a team was formed to build a framework that can do all this and we got what we call: Ziggurat.
  • Moving our payments processing flow, one of the more critical parts of our system, was our primary candidate to migrate into a Ziggurat actor.
  • This migration has been a success, it’s been running smoothly since the past 4 months and processes over 3 million bookings in a day.
  • We’re now starting to realize the power of this model and how it can further help in reducing accretion in big systems like ours.

Payments has a new home and other assorted takeaways [15 min]


  • Show some of the internals of how the Payments Processor and Ziggurat actors work. The framework patterns, retry mechanisms, auditing etc.
  • Show how payments can be audited and how we can get timely alarms for when even a single payment is in trouble
  • Compare the previous system with how things are now. See how much this extraction objectively simplified things and how much more it can if this process were to be repeated for similar problems that currently exist in the OMS
  • Limitations of actors: problems in indirection of call path, lack of availability of the source of truth of the data, performing ad-hoc state changes, keeping things idempotent in the asynchronous world.
  • Did Clojure really help here?
  • Is this not-really-a-framework, but maybe-a-library ever going to be open-source?

Q&A [5 min]

Notes

This is a two-person talk. My co-presenter details are as follows:

Name: Sourabh Ghorpade
Email: sourabh.ghorpade@gmail.com
Bio: Sourabh is a developer at GoJek and a recent recruit in the Clojure camp. He enjoys refactoring systems to scale better. He also feels writing about himself in the third person is rather narcissistic, but oh well.
Shirt Size: M
Organization: GO-JEK