Chaos Engineering is a powerful practice that is changing the way software is designed, developed and operated.
What is Chaos Engineering?
Chaos Engineering provides us with the techniques and methods to identify potential failures before they affect our customers. In the end, Chaos Engineering aims to improve the stability and resilience of our systems.
Modern cloud applications and architectures, reliable software is an essential requirement. Reliable software is a fundamental necessity in modern cloud applications and architectures. Increasingly distributed systems, significantly increase the potential for unplanned downtime and unexpected failures.
Today's highly available systems are complex and cannot be understood at first glance. It must be ensured that the autonomously developed and at any time newly deployed applications find and access their required resources and dependencies. Temporary failures due to a new deployment or an error must not lead to the failure of the application. Applications must be able to deal with the disruptions and react to them with appropriate behavior.
You don't choose the moment, the moment chooses you! You only choose how prepared you are when it does.
~ Fire Chief Mike Burtch
There are unknown servers or components that can cause the entire system to fail. Nowadays, the feeling of being hunted by our own systems that we build and operate!
How can we solve the conflict on the one hand to become faster and better without losing control?
Our customers must always have the feeling and certainty that the system is in a stable condition. From a company's point of view, the loss of sales that could result from a failure is a direct effect. Another is a possible damage to the reputation that the service enjoys with customers. If the availability and behaviour is unsatisfactory, word gets around quickly in a networked world and also has a direct impact on sales. If such a wave of indignation in the social media has occurred, companies can no longer react to it in an appropriate and controlled manner.
Besides the goal of developing more stable software, the social aspect of chaos engineering should not be underestimated. It's not about destroying something, but about bringing the right people together in one place and jointly pursuing the goal of stable and fault-tolerant software. Chaos Engineering must not be used to show or blame colleagues for mistakes or wrong decisions made while developing software. The goal is to improve software and provide the best possible service to customers. Even if the hut is burning in the background, this should not have a serious impact on the customer.
In recent years, Netflix has been one of the drivers behind Chaos Engineering and has contributed significantly to the growing importance of Chaos Engineering in distributed systems. Kyle Kingsbury, security researcher, takes a slightly different approach, verifying the promises of manufacturers of distributed databases, queues and other distributed systems. With his tool Jepsen, he probes the behavior of the aforementioned systems and occasionally comes to frightening conclusions. You can find a very impressive talk on this topic on YouTube.
In this article I hope to give you a simple and descriptive introduction to the world of Chaos Engineering. As a neat side-effect, Chaos Engineering will allow you to personally meet all of your colleagues within a short time? whether you want to or not! (But only if you do it wrong.)
When we develop new or existing software, we toughen our implementation through various forms of tests. We are often referred to a test pyramid that illustrates what kinds of tests we should write and to what extent.
When creating unit tests, we write test cases to check the expected behavior. The component we are testing is free of all its dependencies and we keep their behavior under control with the help of mocks. These types of tests cannot guarantee that they are free of errors. If the developer of the module had a logic error in the implementation of the component, this error will also occur in the tests - regardless of whether the developer has first implemented the tests and then the code. One possibility to solve this is Extreme Programming, in which the developers continuously alternate between writing the tests and implementing the functionality.
In order to allow the developers and stakeholders to spend more free time and relaxed weekends with family and friends, we write integration tests after the unit tests. These test the interaction of individual components. Integration tests are ideally run automatically after the successfully tested unit tests and test-interdependent components.
During a system test, a fully integrated system is tested to ensure that the system complies with its requirements. Usually the test is performed on a test environment and is performed with test data. The test environment should simulate the customers productive environment, that is, be as similar to it as possible.
Testing Types, Techniques and Tactics
If you go deeper into the world of software testing, you will find an almost endless list of types, techniques and tactics.
Thanks to high test coverage and automation, we achieve a very stable state of our application, but who does not know this unpleasant feeling on the way to the most beautiful place in the world?
What I mean is production, where our software has to show how good it really is. Only under real conditions we can see how all the individual components of the overall architecture behave. This unpleasant feeling has even been reinforced by the use of modern microservice architectures.
What makes Chaos Engineering different and how can it help me solve my problems in production?
Chaos Engineering 101
Before you start your first chaos experiments, make sure that your services can already apply a resilience pattern and deal with the possible errors.
Chaos Engineering doesn't cause problems. It reveals them.
~ Nora Jones
As Nora rightfully points out, Chaos Engineering is not about creating chaos, but about preventing it. So, if you want to begin your chaos experiments, start small and be aware of the implications.
Rules of Chaos Engineering
Once again, it makes no sense to commence engineering chaos if your infrastructure - and especially your services - are not prepared for it. Heed this very important instruction, and we'll now begin our journey into Chaos Engineering.
- Talk to your colleagues about the planned chaos experiments in advance!
- If you know your chaos experiment will fail, don't do it!
- Chaos shouldn't come as a surprise, your aim is to prove a hypothesis.
- Chaos Engineering helps you understand your distributed systems better.
- Limit the blast radius of your chaos experiments.
- Always be in control of the situation during the chaos experiment!
Level of Chaos Engineering
Chaos engineering takes place on many different levels, which I always like to divide into hard and soft levels.
The first thing you think of when you have a failure is the hardware, a server, router or other important physical components break down. With suitable tools we are able to simulate failures or even cause them.
In the soft level we carry out experiments to review our organization, our team, processes and practices. We make hypotheses on how long it takes for our monitoring to detect a failure, start the appropriate processes to resolve it and whether all people involved are up to the situation.
It is essential to define metrics which give you a reliable statement about the overall state of your system. These metrics must be continuously monitored during the chaos experiments. As a nice side effect, you can also monitor these metrics outside of your experiments.
Metrics can be both technical or business metrics - I'd say that business metrics outweigh technical metrics. Netflix monitors the number of successful clicks to start a video during a chaos experiment; this is their core metric and it comes from the business domain. Customers not being able to start videos have a direct effect on customer satisfaction.
For example, if you run an online shop, the number of successful orders or the number of articles placed in the shopping basket would be an important business metrics.
Think about what should happen in advance, and then prove it through your experiment. If your hypothesis is invalidated, you must locate the error based on the findings and bring it up with your team or company. This is sometimes the hardest part - definitely avoid finger-pointing and scapegoating! As a chaos engineer, your goal is to understand how the system behaves and to present this knowledge to the developers. This is why it's important to get everyone on board early and to let them participate in your experiments.
What awaits us in real life? What new mistakes could happen, and which ones already ruined our previous weekends? These and other questions must be asked and tested for in a controlled experiment.
Potential examples include:
- Failure of a node in a Kafka cluster
- Dropped network packets
- Hardware errors
- Insufficient max-heap-size for the JVM
- Increased network latency
- Malformed responses
You can extend the list however you like and it will always be closely linked to the applications architecture. Even if your application is not hosted at one of the well-known cloud providers, things will go wrong in your own company's data center. I strongly suspect you could tell me a thing or two about it!
To maintain control at all times within the execution of a chaos experiment, developers must set clear limits: the blast-radius. It controls which components influence an experiment and which services are directly affected by the changes. Before the experiment is executed, it is clear to everyone involved what will be changed and to what extent. If, during the experiment, a component outside the defined radius shows a changed behavior, the experiment must stop and the analysis to be started. The resulting findings must be reported to the teams and responsible persons: It is necessary to eliminate the errors and then to re-run the experiment.
If the intended experiment is successful, the blast-radius is extended and further components are influenced. For example, it is possible to have multiple instances of a scaled service respond with errors or a higher latency. The blast radius does not formally have a defined end, but it should be noted that there is a logical limit at some point where the blast radius has reached a size that it cannot reach in production or to which the system cannot respond. The failure of a complete data center would be an example.
Cycle & Continuous Verification
The procedure during a chaos experiment is always the same.
- Think about which scenarios you want to check and develop an experiment.
- Define the appropriate Steady State to detect at any time if your experiment gets out of control.
- Formulate your Hypothesis and prove it with the help of your experiment.
- At the beginning you always start with a small Blast Radius and extend it from run to run of your experiment.
Now it starts and you execute your first Chaos Experiment. Observe yourself as you execute...
If there are no failures, we are bored and frustrated!
~ Benjamin Wilms
Seriously, you will notice this behavior by yourself, because only when we have really discovered mistakes that don't blow up in production later on, we will be satisfied.
Your systems are constantly changing: new versions are going into production, hardware is being replaced, firewall rules are adapted and servers are restarted. The elegant way is to establish the culture of Chaos Engineering in your company and in the minds of its people.
There are many advantages in doing Chaos Engineering, but I'll outline two which I think are the most important:
First, chaos engineering helps me to detect and eliminate technical debts and the so-called dark debts in my system. John Allspaw has written a very readable article about dark debts. Because if there's one thing production can do, it's knowing when we're on vacation, planning a weekend at the ocean, or wanting to go out with friends in the evening.
Secondly, Chaos Engineering helps us to better understand the systems we build and operate and to regain confidence and trust. The complexity quickly makes us lose focus on our customers, for whom we want to deliver added value when they use our applications.
In this process chaosmesh supports you and provides appropriate tools and lets you benefit from our longtime experience in building distributed and complex systems.