Chaos Experiments Under the Lens of AIOps

By Michele Dodič

Elevator Pitch

AIOps and Chaos Engineering are two concepts, which are often kept separate. In this talk, we will discuss (and show you!) how both practices combined can significantly increase cyber resiliency, while at the same time maintain full end-to-end transparency and observability of your entire system.

Description

Imagine this: you’re a Site Reliability Engineer (SRE) at a major tech giant and you are responsible for the overall system health, which is running in prod. Numerous alerts, server crashes, Jira tickets, incidents and an avalanche of responsibilities, which sometimes simply feel like a ticking time bomb. These are just some of the daily struggles an average SRE needs to go through. But why should it be like that? Well, it shouldn’t - thanks to a term coined by Gartner in 2016. AIOps, meet audience. Audience, meet AIOps.

Let’s extend this scenario. On top of all of the above mentioned issues, our poor SRE needs to watch out for potential security breaches and make sure nothing ever gets in through the cracks. However, by conducting proactive experimenting, continuous verification and improvement, he makes sure that the system is able to withstand these turbulent and malicious times that we’re living in. Do these notions ring any bells? They sure do! Chaos Engineering (CE), meet audience. Audience, meet Chaos Engineering.

What’s our angle, you’re wondering? AIOps and CE are two concepts, which are often kept separate. In this talk, we will discuss (and show you!) how both practices combined can significantly increase cyber resiliency, while at the same time maintain full E2E transparency and observability of your entire system.

For this session, we have prepared and analyzed several use cases, followed main principles, summarized best practices and prepared a live demo through a combination of CE and AIOps tools.

Above all, we are SRE Engineers. As such, during this session, we will stay close to the SRE principles and best practices that we used to achieve our goals, e.g. reduce organizational silos, measure everything, learn from failures, analyze changes holistically, etc… As we proceed with our talk, the audience will be able to identify how these are related to AIOps, as well as CE, and finally, how it all ties together.

Notes

Who are we?

Michele Dodič & Francesco Sbaraglia - SRE tech leads in ASG, Accenture

We are an Accenture group of highly motivated SRE engineers, specialized in the state-of-the-art Chaos Engineering practices. Our goal is to promote a growth mindset that embraces agility, DevOps and SRE as the ‘new normal’, and establish a shared understanding of the cultural, business and technical aspects of this new operations revolutions taking place right here, right now. As SRE practitioners, we aim to fully embrace the blameless culture and accept failure as a mean to learn and improve. In fact, as public speakers, we love sharing our knowledge and findings with everyone (SREcon Europe 2021, Conf42: Chaos Engineering 2022, Splunk .conf22 Las Vegas, DevOpsCon 2022 Berlin, data2day 2022, Conf42: DevOps 2023).