Excellence in Tech Operations
What is operational excellence?
Operational excellence is a mindset embraced across an organisation to maximise outcomes and positive results. It is a crucial aspect of successful engineering teams.
Achieving operational excellence requires a focused approach to identifying and eliminating waste, reducing errors, and optimising workflows.
This blog post will discuss how our teams achieved operational excellence by using key techniques.
At Deriv, our teams utilised these techniques when launching a new service to replace an existing one that was a critical part of the user journey. By applying these techniques, the teams successfully launched the new service without impacting clients or any downtime, ensuring a smooth transition for users.
If the service in question had a downtime, it would have impacted 70% of our user base directly or indirectly.
Techniques
Documentation
The team documented the requirements as a first step. Documentation plays a vital role in clarifying the purpose of the change/feature to be implemented.
It helped us to create a shared understanding of the problem and ensured that everyone on the team was aligned on the goals and objectives. It also helped to prevent rework or wasted effort, as we had a clear understanding of what was expected from the beginning.
Deploy first, release later philosophy
The "deploy first, release later" philosophy is a technique that we adopted to ensure that any new code we develop is deployed to our production environment as soon as it is ready.
However, we do not immediately release it to our customers. This allows us to ensure that the code is stable and functioning correctly before we make it available to our users.
It also enables us to catch any issues early in the development process and address them before releasing the feature to our users.
It allows us to smoke-test the new feature in production without the load associated with a new feature release.
Deploy != Release
Implementing Feature Flags
Feature flags are an essential part of our development process. They allow us to toggle specific features on and off during runtime without releasing new code. This gives us the flexibility to test new features and experiment with different options without affecting the user experience.
By implementing feature flags, we tested the new feature internally before releasing it to a subset of users. We gradually increased the number of users as we gained confidence in the feature's stability.
This also provided us with a fallback to the old implementation in case of any issue with the newly deployed service.
Observability and operation scripts
Having observability and operation scripts ready before release is essential for ensuring a smooth deployment process.
We created custom operational scripts to mimic the user behaviour to test the functionality and automate common operations tasks. This enables us to quickly detect and respond to issues, reducing downtime and improving the overall customer experience.
Also, we created a few important metrics — system CPU and memory usage, request success and failures count, etc. — for our monitoring system and enabled logging for observability of the new service.
Implementing parallel run
The parallel run is the practice of keeping an old system operational after launching a new one. The term "parallel operation" refers to the practice of running both the old and new systems concurrently for some time until there is enough assurance that the new system is reliable and effective. A reconciliation process generally accompanies it to validate the data.
The parallel run is best used for calls that don’t change the state (read-only) calls.
We implemented a parallel run to send the read-only user requests to both existing and new services. By adopting this practice, we covered a few minor implementation issues in terms of differences in functionality between the services.
Did we face issues?
Oh, yes! We faced a few issues. Luckily, they were all internal — not client-facing — due to implementing the above techniques.
A few issues we encountered:
- We faced network issues between the calling service and the new service during our testing.
- We missed one mapping issue in the code, resulting in errors in a valid response.
- Even after implementing all these, we had one memory issue where we had to fall back on old service for some time (load testing would have helped).
Conclusion
Achieving operational excellence is an ongoing process, requiring a continuous focus on identifying and eliminating waste, reducing errors, and optimising workflows. By adopting a deploy first, release later philosophy, implementing feature flags, having observability and operation scripts ready before release, and implementing parallel runs when needed, our team has been able to streamline our development process and deliver a better customer experience. These techniques have enabled us to reduce downtime, catch issues early in the development process, and experiment with new features, all while maintaining a stable and reliable system.
Key takeaways
- Operational excellence is about mindset as much as it is about techniques.
- Documenting before implementation saves a lot of time in the long run.
- Use parallel run with caution, preferably for read-only operations.
- Lastly, don't follow the above practices blindly, use your judgment.