Resiliency in Distributed Systems

By Rajeev N Bharshetty

Elevator Pitch

Keeping distributed systems up and running is a hard problem to solve at scale. Adopting some basic patterns can help us guard our systems from sudden spike in traffic, dependency failures, network issues, slower downstream services and can help us achieve considerable uptime for our systems.

Description

Running distributed systems with high uptime is hard. Faults always occur in a complex distributed environment with too many moving parts. Systems need to be designed from the start to be resilient against some of the common faults in live production systems at scale such as sudden surge in traffic, bad or failed dependencies, network outages, hosts going down etc.

To safe guard against these failures and potential business loss, we discuss some of the basic patterns to be followed in designing resilient distributed systems at scale such as Circuit Breakers, BulkHeads, Fallbacks, Redundancies, Metrics and Monitoring.

This talk is for everyone who is interested in building highly reliable distributed systems in Go and also hate answering pagers at 3 am in the morning.

Notes

We will answer some of the following questions in the talk:

  • What are the different challenges distributed systems at scale pose ?
  • Why building resiliency into your systems is important ?
  • How do we achieve resiliency in complex distributed systems ?
  • Cover Difference between Faults vs Failures
  • Proactive and Reactive measures in making systems reliable
  • Patterns to achieve highly reliable systems in Go such as Circuit breakers, Bulkheads, Connection Pooling, Timeouts and Retries, Metrics collection

Will be showcasing most of the above patterns with an open source library me and my team wrote: (Heimdall HTTP Client: https://github.com/gojek-engineering/heimdall)

Having worked with distributed systems at scale with high uptime requirements at GO-JEK, most of these patterns are learnt the hard way. The talk will be culmination of all my experiences in building these reliable systems at GO-JEK scale.