Öğrenme

Excellence in Tech Operations

Teşekkür ederim! Gönderiniz alındı!

Hata! Formu gönderirken bir şeyler ters gitti.

Excellence in Tech Operations

March 18, 2024

Learning

Bu makale şu tarihte güncellenmiştir

March 18, 2024

What is operational excellence?

Operational excellence is a mindset embraced across an organisation to maximise outcomes and positive results. It is a crucial aspect of successful engineering teams.

Achieving operational excellence requires a focused approach to identifying and eliminating waste, reducing errors, and optimising workflows.

This blog post will discuss how our teams achieved operational excellence by using key techniques.

At Deriv, our teams utilised these techniques when launching a new service to replace an existing one that was a critical part of the user journey. By applying these techniques, the teams successfully launched the new service without impacting clients or any downtime, ensuring a smooth transition for users.

If the service in question had a downtime, it would have impacted 70% of our user base directly or indirectly.

Techniques

Documentation

The team documented the requirements as a first step. Documentation plays a vital role in clarifying the purpose of the change/feature to be implemented.

It helped us to create a shared understanding of the problem and ensured that everyone on the team was aligned on the goals and objectives. It also helped to prevent rework or wasted effort, as we had a clear understanding of what was expected from the beginning.

Deploy first, release later philosophy

The "deploy first, release later" philosophy is a technique that we adopted to ensure that any new code we develop is deployed to our production environment as soon as it is ready.
However, we do not immediately release it to our customers. This allows us to ensure that the code is stable and functioning correctly before we make it available to our users.
It also enables us to catch any issues early in the development process and address them before releasing the feature to our users.

It allows us to smoke-test the new feature in production without the load associated with a new feature release.

Deploy != Release

Implementing Feature Flags

Feature flags are an essential part of our development process. They allow us to toggle specific features on and off during runtime without releasing new code. This gives us the flexibility to test new features and experiment with different options without affecting the user experience.

By implementing feature flags, we tested the new feature internally before releasing it to a subset of users. We gradually increased the number of users as we gained confidence in the feature's stability.

This also provided us with a fallback to the old implementation in case of any issue with the newly deployed service.

Observability and operation scripts

Having observability and operation scripts ready before release is essential for ensuring a smooth deployment process.

We created custom operational scripts to mimic the user behaviour to test the functionality and automate common operations tasks. This enables us to quickly detect and respond to issues, reducing downtime and improving the overall customer experience.

Also, we created a few important metrics — system CPU and memory usage, request success and failures count, etc. — for our monitoring system and enabled logging for observability of the new service.

Implementing parallel run

The parallel run is the practice of keeping an old system operational after launching a new one. The term "parallel operation" refers to the practice of running both the old and new systems concurrently for some time until there is enough assurance that the new system is reliable and effective. A reconciliation process generally accompanies it to validate the data.

The parallel run is best used for calls that don’t change the state (read-only) calls.

We implemented a parallel run to send the read-only user requests to both existing and new services. By adopting this practice, we covered a few minor implementation issues in terms of differences in functionality between the services.

Did we face issues?

Oh, yes! We faced a few issues. Luckily, they were all internal — not client-facing — due to implementing the above techniques.

A few issues we encountered:

We faced network issues between the calling service and the new service during our testing.
We missed one mapping issue in the code, resulting in errors in a valid response.
Even after implementing all these, we had one memory issue where we had to fall back on old service for some time (load testing would have helped).

Conclusion

Achieving operational excellence is an ongoing process, requiring a continuous focus on identifying and eliminating waste, reducing errors, and optimising workflows. By adopting a deploy first, release later philosophy, implementing feature flags, having observability and operation scripts ready before release, and implementing parallel runs when needed, our team has been able to streamline our development process and deliver a better customer experience. These techniques have enabled us to reduce downtime, catch issues early in the development process, and experiment with new features, all while maintaining a stable and reliable system.

Key takeaways

Operational excellence is about mindset as much as it is about techniques.
Documenting before implementation saves a lot of time in the long run.
Use parallel run with caution, preferably for read-only operations.
Lastly, don't follow the above practices blindly, use your judgment.

‍

Siz de beğenebilirsiniz

Developers working on an asynchronous code project.

Explore our journey from late-night server maintenance to mastering Kubernetes on AWS, gaining valuable insight into scalability and resilience.

Illustration of cybersecurity in software development

March 19, 2024

Discover how we enhanced our approach to secret detection and prevention, leading to stronger cybersecurity practices in software development.

DevOps peformance metrics on a modern laptop

May 10, 2024

Find out how we're using DORA metrics and other DevOps indicators to improve performance and create a culture of data-driven ownership.

An illustration representing innovation in trading

March 18, 2024

2023 unfolded like a dynamic tech map for Deriv. Find out how we prepared ourselves for more trading innovations.

An illustration showing connection between AWS and GCP

March 18, 2024

Check out our blog to learn how we designed and implemented the GCP and AWS connection using the Border Gateway Protocol.

An illustration representing WebAssembly module

March 18, 2024

In this blog, we'll go into the details and best practices in WebAssembly that we gained from our proof-of-concept experience.

An Illustration on Windows Server Automation

March 18, 2024

Find out how we automated installations of applications that communicate with external third parties for analytics and monitoring purposes.

March 18, 2024

When you have 50+ cloud servers to handle MT5 platform operations, automating the process is a necessity for operational reliability. Learn more

An illustration showing migration using Async Await

March 18, 2024

When rewriting code to use the new async/await syntax by Future::AsyncAwait, you need to keep a few things in mind. Learn more in our blog.

March 18, 2024

An illustration on using Chrome extensions for WebSocket debugging

March 18, 2024

This article discusses how you can use Chrome extensions in conjunction with the debugger to intercept and decode WebSocket traffic. Using a practical example named "Deriv WebSocket Trace,” it will walk you through quickly setting up a Chrome extension for WebSocket debugging

April 23, 2024

Dangling IP address takeovers cause great monetary and reputation loss for companies. Learn how they happen and how to mitigate this risk.

March 18, 2024

Operational excellence is a mindset embraced across an organisation to maximise outcomes and positive results. It is a crucial aspect of successful engineering teams

Excellence in Tech Operations

What is operational excellence?

Techniques

Documentation

Deploy first, release later philosophy

Implementing Feature Flags

Observability and operation scripts

Implementing parallel run

Did we face issues?

Conclusion

Key takeaways

Siz de beğenebilirsiniz

Join our team