Release the chaos monkey
As organisations scale, and both business and technical landscapes evolve, the ability to adapt and respond to change is crucial for product teams. By embracing the unexpected and preparing for potential disruptions, organisations can build resilience and foster collaboration across various departments.
This is where chaos days come into their own: by seconding one or more people within an organisation - chaos monkeys - to intentionally and systematically sabotage a product or service on a given day, to enable teams to assess their own reaction in the face of adversity.
Drawing inspiration from Netflix's chaos monkey, which randomly terminates virtual machines to ensure the resilience of their systems, this approach for product teams seeks to build robustness and adaptability through intentionally introducing chaos and uncertainty.
And crucially, the chaos monkey approach is not just for engineering. It is a health check for your process, comms, and alignment as a team, and your all-round observability, as much as it is a test of your tech stack and your ability to actually resolve problems.
Chaos days are particularly impactful in start-ups and scale-ups in the build-up to significant product launches and milestones, and also when looking to embed or evolve a dedicated support function within an organisation.
The chaos monkey approach revolves around three key stages:
- Planning and preparation
- The chaos day itself
- Reflection and adaptation
Planning and preparation
To successfully unleash the chaos monkey in your organisation, thorough planning and preparation are key. Begin by assigning one or more dedicated people who can organise and execute the chaos day. I’ve found DevOps engineers and principal engineers / architects tend to make the best chaos monkeys, but this is by no means a prerequisite. It will help if they know the product or service well, and have (or can easily get) access to the supporting infrastructure.
It’s often difficult for teams themselves to block out time for a chaos day, and be disciplined enough to undertake them regularly. Where I’ve seen organisations execute them really well and get the most out of them, they’ve been organised and orchestrated on behalf of teams.
My recommended frequency is quarterly, but even if you only manage to do a one-off chaos day, I’d still strongly recommend it.
It’s usually wise for chaos monkeys to ask themselves the following questions:
-
Is this a good day for chaos? It might be wise to defer if the team is already flat out working towards an impending deadline, for example.
-
How broad do we want to go? Do we want to keep this to a single team’s domain, or do we want to try and break the whole organisation? The former is simpler to execute. The latter can yield greater learnings, particularly cross-team.
-
How will we execute? A conversation with the tech lead of the relevant team is usually enough to outline rules of engagement. Will it all happen on a staging environment? Does the chaos monkey have access to all of the tooling in order to inflict the desired level of chaos?
-
Will the chaos monkey give the team half an hour to detect an issue themselves before they report it, in an attempt to battle-test their observability capability?
-
Who needs to know about the chaos day? I’ve seen c-suite members in a panic before now on the back of a chaos monkey outage as they thought it was a genuine production issue. It’s good to communicate widely and often.
The chaos day itself
During the chaos day, the chaos monkey will introduce unexpected challenges and disruptions across the organisation. Potential disruptions could include:
- Modifying infrastructure
- Committing breaking changes to codebases
- Spoofing attacks
- Spoofing user or stakeholder issues
- Tooling unavailability
The aim of any chaos day is to test the adaptability and resilience of cross-functional teams, compelling them to think creatively, prioritise effectively, and collaborate efficiently to overcome the challenges presented.
The team is expected to treat the incident as if it were a production incident, so it is not uncommon for people outside of the team to also be involved.
A few tips for budding chaos monkeys:
- Create your own runbook ahead of time. Have five to ten disruptions planned, with steps to reproduce. Introduce them in dev beforehand if you can without giving the game away, to give you confidence that you can execute easily on the day. You can even go further and list out some clues you could also drop to help the team diagnose an incident.
- Don’t over do it! If a team are struggling with one issue, keep your next one in your back pocket rather than hitting them with multiple at the same time.
- Sense when to put the team out of their misery. Some disruptions can go under the radar for many different reasons. If this happens, feel free to move on to the next one to try and make the most of the day.
And a few suggestions for the lucky ones who are about to welcome the chaos monkey into their domain for the day:
- Don’t over-prepare. Perhaps be that little bit more vigilant on checking your observability dashboards, but there’s no need for any dedicated prep. In fact, it’s better you do none, to make it as close to reality as possible.
- Go easy on yourselves. I’ve never seen a chaos day where one of the items hasn’t gone undiscovered. That’s the whole point, so don’t beat yourself up about it. This isn’t a test that you can pass or fail.
- Don’t fake it. If you’re e.g. the product manager in the team, don’t force yourself into the conversation if you wouldn’t usually be in the mix in real life. It’s easy to get swept up in the fun and excitement of a chaos day, but its crucial that the incidents are reacted to as they would be on production. This isn’t an opportunity to be a hero for a day.
Reflection and adaptation
Following the chaos day, reflection and adaptation are crucial. You can capture the learnings from the experience through:
- Conducting a thorough debrief session with all team members involved. It’s good for the chaos monkey to conduct this, and play back their perspective on how things went.
- Identifying key learnings, areas for improvement, and opportunities for growth.
- Documenting the outcomes and insights gained from the chaos day.
- Implementing changes and adaptations to enhance the resilience and adaptability of cross-functional teams.
Benefits of the chaos monkey approach
Embracing the chaos monkey can lead to many benefits for cross-functional teams, including:
- Improved adaptability and resilience to unexpected challenges.
- Enhanced teamwork, collaboration, and communication skills across the whole org.
- A greater understanding of the strengths and weaknesses of cross-functional teams.
- The ability to identify and address potential vulnerabilities before they escalate into critical issues.
Plus, they're really fun! And they set a great precedence of being on the front foot - being able to proactively set some time aside away from the grind to improve as a team.
In summary
In an industry where change is constant and the unexpected can happen at any moment, the chaos monkey approach offers a proactive method for bolstering the resilience and adaptability of cross-functional teams and the wider organisation as a whole.
By embracing chaos and uncertainty, organisations can ensure that they are better prepared to face the challenges of tomorrow.
The next time you're looking to strengthen your product teams, consider giving the chaos monkey a call.