What Is Chaos Engineering and What Are Its Benefits?
What is Chaos Engineering? Who pioneered Chaos Engineering? What is the role of observability in Chaos Engineering? Why does Netflix use Chaos Engineering? What are the benefits of Chaos Engineering?
Looking for answers to such questions? Keep reading.
What is Chaos Engineering?
Chaos Engineering is a method to test the reliability of a software system by injecting chaos into it. This method experiments with the functionality and reliability of a system in the face of any unexpected disturbance or problem.
By using Chaos Engineering, an organization can create backup software components or functions that keep the software running during unexpected problems.
Who pioneered Chaos Engineering?
In 2010, Netflix faced database disruption in the relational table model, after which the streaming giant decided to move to the cloud. After migrating to the AWS cloud infrastructure, Netflix engineers realized that no single component could guarantee 100 percent uptime.
Related Post: Azure vs AWS: Which is better?
However, with different processes running, it was difficult to test the resilience of cloud-based large-scale distributed systems. Netflix used Chaos Engineering to test different variables and components without impacting the end user.
Netflix conducted the first Chaos Engineering experiment by terminating production instances and chewing data tables to ensure that the entire system does not collapse when specific services experience failure.
What is Chaos Monkey?
Inspired by the idea of monkeys entering a farm and randomly destroying the property, Netflix developed Chaos Monkey.
Chaos Monkey is a first-of-its-kind system software to check the recoverability of its web services infrastructure.
Chaos Monkey software simulates failures at different stages of development to help organizations and software developers prepare for different unexpected situations.
What is the Simian Army?
The Simian Army comprises open-source cloud testing tools that allow developers to test the resilience, security, recoverability, and reliability of cloud services.
After the development of Chaos Monkey, Netflix engineers started developing more autonomous software agents for Chaos Engineering. Thus, they developed the Simian Army.
The Simian Army includes Latency Monkey, Conformity Monkey, Security Monkey, Janitor Monkey, Doctor Monkey, and Chaos Monkey.
- Latency Monkey simulates service degradation to see how upstream services react.
- Conformity Monkey identifies and shuts down instances that are not coded as per best practices and gives developers the platform to relaunch the instances.
- Security Monkey checks the DRM and SSL certificates for expiry. Also, SM ends any instances that do not conform to the security standards.
- Janitor Monkey checks all the instances for unused resources and discards them instantly.
- Doctor Monkey checks the health of cloud instances and monitors the external health of systems like CPU and memory storage.
- Chaos Monkey randomly terminates different instances to see how service shutdown affects the overall system.
What is the difference between Chaos Engineering and testing?
Most software applications go through traditional testing that uses a set of inputs to see if the predicted outputs come from the application. If the predicted outputs do not come, the software developer works to achieve them.
Unlike traditional testing, Chaos Engineering uses experiments and unusual combinations to test software applications and systems. By doing this, organizations increase the scope of testing and check how the software will perform in the face of an unexpected situation.
Related Post: Unit Testing vs Integration Testing
What is the role of observability in Chaos Engineering?
Observability is the process of understanding the internal components of a software system by analyzing the external outputs. Observability dives deeper into the different failure modes of a system and uses key insights from such modes to create new failsafe iterations.
Observability in Chaos Engineering enables faster deployments, helps prioritize business KPIs, and helps develop system auto-healing, among others. In addition, observability considers the correlation between the monitoring, logging, tracing, and data aggregation to troubleshoot problems and find solutions.
Organizations can use artificial intelligence and machine learning to create observability patterns and antipatterns. Organizations can use regression analysis, time series analysis, and trend analysis to build definitive observability patterns and antipatterns.
How does Chaos Engineering work?
Chaos Engineering consists of four steps.
-
Hypothesis
The first step is the hypothesis, wherein engineers think about what can happen to the state of the application upon changing a variable. Hypothesis allows chaos engineers to ask many questions and write down their assumptions. Later, they compare these assumptions with real-life events. -
Testing
In the testing phase, chaos engineers use a simulated environment along with load testing to check the changes in services, infrastructure, network, and devices. If the results differ from the assumptions, then the chaos engineers restructure or rebuild the component. -
Blast radius
The extent of the damage done in the testing phase is known as the blast radius. Chaos engineers set up a blast radius during the testing of specific variables and components. -
Insights
Insights consist of the results of the hypothesis, testing, and blast radius used in Chaos Engineering. By using insights, chaos engineers can restructure and rebuild components that perform better during unexpected situations.
Types of experiments in Chaos Engineering
Here are the types of experiments in Chaos Engineering.
-
Dependency testing
Most times, chaos engineers assume a happy scenario for the software development process after conducting standard tests. However, this step backfires sometimes, especially when there are many dependencies.
Therefore, Chaos Engineers must conduct thorough tests and check hidden dependencies between microservices, reddis, database, memcached, and downstream services. By doing such tests and checks, they can understand the challenges that may cause failure in the production and post-production stages. -
Inject failure
Inject a failure or something that can cause your software to behave differently is essential for Chaos Engineering. With this experiment, engineers can discover weaknesses or vulnerable components of the software, and build something to keep the software running when a particular component malfunctions. -
Automate faults
After coming across different faults while checking the reliability of the system, engineers use site reliability engineering to try and fix faults automatically. With such automation, they check which automatic solutions work and for which functions they need to build backup components.
What are the benefits of Chaos Engineering?
The benefits of Chaos Engineering include:
-
Promotes innovation
Chaos Engineering promotes innovation by identifying design and structural flaws in the software system. The intelligence gathered from understanding structural and design flaws helps improve new and existing components. -
Greater collaboration
Chaos Engineering facilitates greater collaboration, as the insights gathered are not limited to chaos engineers but get shared across different departments. -
Streamlines incident response
Incident response is important for applications that need to run all the time. By testing variables and components in advance, Chaos Engineering helps streamline troubleshooting, repairs, and incident response. -
Boosts business
Organizations that use Chaos Engineering can build resilient and reliable systems that increase customer satisfaction. Also, these resilient software applications can boost business demand by producing less failure-prone software.
How to set up a Chaos Engineering culture and what is a game day?
To set up a Chaos Engineering culture, have a game day. A game day is a dedicated day to run Chaos Engineering experiments on software and computer systems. On a game day, simulate an environment of failure. Then, check how your team and computer system responds to different types of failure.
How to plan and run a game day?
Here are the steps to plan and run a game day:
-
Compile a list of all failure scenarios
List down all the variables and components that can break down or malfunction. Some of the common questions that you can ask are – can your systems support 15x the current load? What will happen if your servers run out of disk space? How will your system respond in case of a DDOS attack?
Answering all the questions mentioned above can be difficult during a single game day. So, narrow down the questions as per the impact they have on your software, and distribute them on different days. -
Create a series of hypothesis
After you have selected the failure scenarios, it is time to create a series of hypothesis. While creating a series of hypothesis, make sure that you create a step-by-step process for each hypothesis.
A detailed hypothesis with possible outcomes will help you measure the proposed outcomes against the real outcomes and build your next strategy effectively. -
Check how your team reacts
A game day is not only about testing for failure scenarios but also preventing them. Organizations must see how different teams react while running experiments and fixing problems.
Teams that do not communicate well and take longer than expected to fix a problem must receive communication and collaboration training. By doing this, you can ensure that they are ready when a failure arises in real time. -
Address discovered gaps
The final step of the game day does not have to happen on the game day itself but must happen soon. Chaos engineers must address the errors and gaps the right way with a short-term solution to ensure that the applications keep working as expected.
Also, chaos engineers must prepare detailed plans to rebuild or replace components and variables that trigger failure.
What are some of the common failure scenarios in Chaos Engineering?
Some common failure scenarios in Chaos Engineering include:
-
Disk space overuse
The goal of overusing disk space is to see whether your software sends an alert upon meeting a certain threshold. If you don’t see an alert, you must fix the issue right away. -
EC2 shutdown
EC2 helps you develop and deploy applications faster by storing data in RAM. By forcing an EC2 shutdown, you can check whether the software application is working, or has lost data. -
Load balancer adjusting
Load balancers play a crucial role in distributing incoming traffic from users to back-end servers. By shutting one request from one user at a time, you can check whether your load balancer still works or not. -
Swap security groups
Security groups configure network security to specify protocols, ports, and IP addresses over which you can send traffic. Security groups consist of various virtual machines and resource tags that you can replace or stop to check for changes in application behavior. -
Force CPU spikes
To check how much volume a software application can handle without breaking down, force CPU spikes by manipulating commands. With this action, you will know how much volume your system can effectively handle.
What tools can I use for Chaos Engineering?
There are a wide variety of tools available to successfully implement Chaos Engineering. They include:
-
Chaos Mesh
Chaos Mesh offers a dedicated dashboard with several in-built experiments and timeframes to inject chaos into your software systems. Also, you can design custom experiments and conduct status checks of different components and development stages. -
Chaos Monkey
Chaos monkey can help you detect different system bottlenecks and offer solutions to resolve the same. Additionally, the open-source tool helps terminate instances and gives a detailed account of failures. -
Litmus
Litmus helps you carry out controlled chaos tests in the production stage. Also, it allows you to implement log capturing, generate reports, detect bugs, and run test suites. - Gremlin
Gremlin offers three attack modes and various failure scenarios to help build software resiliency and reliability. Also, it offers unique features like latency injections, CLI support, memory leak testing, and disk fill-ups.
Summary
Organizations must effectively utilize Chaos Engineering to understand the types of failures that can occur during the use of software applications. By running chaos experiments and test failure scenarios, organizations can better prepare for negative outcomes and build reliable software applications.
Are you looking for developers who can help conduct Chaos Engineering experiments?
Try Turing.
Turing helps companies hire pre-vetted developers within 3-5 days. Visit Turing’s Hire Page today.
FAQ’s
- How do you control the blast radius in Chaos Engineering?
To keep control of the blast radius, introduce one chaos at a time. Also, have a rollback plan in place in case the outcome is not as expected. - What is Chaos Gorilla?
Chaos Gorilla drops an entire AWS environment in a simulated environment to check the impact on users. By disabling computers in the network, Chaos Gorilla checks how the remaining systems respond.
Tell us the skills you need and we'll find the best developer for you in days, not weeks.