Excellence in Tech Operations

March 18, 2024

Learning

මෙම ලිපිය යාවත්කාලීන කරන ලදී

March 18, 2024

What is operational excellence?

Operational excellence is a mindset embraced across an organisation to maximise outcomes and positive results. It is a crucial aspect of successful engineering teams.

Achieving operational excellence requires a focused approach to identifying and eliminating waste, reducing errors, and optimising workflows.

This blog post will discuss how our teams achieved operational excellence by using key techniques.

At Deriv, our teams utilised these techniques when launching a new service to replace an existing one that was a critical part of the user journey. By applying these techniques, the teams successfully launched the new service without impacting clients or any downtime, ensuring a smooth transition for users.

If the service in question had a downtime, it would have impacted 70% of our user base directly or indirectly.

Techniques

Documentation

The team documented the requirements as a first step. Documentation plays a vital role in clarifying the purpose of the change/feature to be implemented.

It helped us to create a shared understanding of the problem and ensured that everyone on the team was aligned on the goals and objectives. It also helped to prevent rework or wasted effort, as we had a clear understanding of what was expected from the beginning.

Deploy first, release later philosophy

The "deploy first, release later" philosophy is a technique that we adopted to ensure that any new code we develop is deployed to our production environment as soon as it is ready.
However, we do not immediately release it to our customers. This allows us to ensure that the code is stable and functioning correctly before we make it available to our users.
It also enables us to catch any issues early in the development process and address them before releasing the feature to our users.

It allows us to smoke-test the new feature in production without the load associated with a new feature release.

Deploy != Release

Implementing Feature Flags

Feature flags are an essential part of our development process. They allow us to toggle specific features on and off during runtime without releasing new code. This gives us the flexibility to test new features and experiment with different options without affecting the user experience.

By implementing feature flags, we tested the new feature internally before releasing it to a subset of users. We gradually increased the number of users as we gained confidence in the feature's stability.

This also provided us with a fallback to the old implementation in case of any issue with the newly deployed service.

Observability and operation scripts

Having observability and operation scripts ready before release is essential for ensuring a smooth deployment process.

We created custom operational scripts to mimic the user behaviour to test the functionality and automate common operations tasks. This enables us to quickly detect and respond to issues, reducing downtime and improving the overall customer experience.

Also, we created a few important metrics — system CPU and memory usage, request success and failures count, etc. — for our monitoring system and enabled logging for observability of the new service.

Implementing parallel run

The parallel run is the practice of keeping an old system operational after launching a new one. The term "parallel operation" refers to the practice of running both the old and new systems concurrently for some time until there is enough assurance that the new system is reliable and effective. A reconciliation process generally accompanies it to validate the data.

The parallel run is best used for calls that don’t change the state (read-only) calls.

We implemented a parallel run to send the read-only user requests to both existing and new services. By adopting this practice, we covered a few minor implementation issues in terms of differences in functionality between the services.

Did we face issues?

Oh, yes! We faced a few issues. Luckily, they were all internal — not client-facing — due to implementing the above techniques.

A few issues we encountered:

We faced network issues between the calling service and the new service during our testing.
We missed one mapping issue in the code, resulting in errors in a valid response.
Even after implementing all these, we had one memory issue where we had to fall back on old service for some time (load testing would have helped).

Conclusion

Achieving operational excellence is an ongoing process, requiring a continuous focus on identifying and eliminating waste, reducing errors, and optimising workflows. By adopting a deploy first, release later philosophy, implementing feature flags, having observability and operation scripts ready before release, and implementing parallel runs when needed, our team has been able to streamline our development process and deliver a better customer experience. These techniques have enabled us to reduce downtime, catch issues early in the development process, and experiment with new features, all while maintaining a stable and reliable system.