Deriv Tech arrow icon pointing to the right

Learning

Data engineering at Deriv: Building robust infrastructure

Thank you! Your submission has been received!

Oops! Something went wrong while submitting the form.

Data engineering at Deriv: Building robust infrastructure

September 19, 2024

Learning

This article was updated on

September 19, 2024

In today’s data-driven world, have you ever felt overwhelmed by the sheer volume of information or frustrated by the lack of a unified, accessible data infrastructure?

At Deriv, we understand these challenges firsthand. As an online trading company, data is the lifeblood that fuels the company’s success. With a staggering 17 terabytes of data processed daily and over 600 users running hundreds of thousands of SQL queries monthly on analytics dashboards, the data engineering team plays a crucial role in building and maintaining the robust infrastructure that powers Deriv’s data-driven operations.

Just like a bustling city relies on its infrastructure to function smoothly, Deriv’s data engineering team acts as the central hub, connecting various data sources and ensuring the timely delivery of information to the different departments.

This reliance on data extends to the very heart of Deriv’s operations. From analysing customer lifetime value and segmenting our customer base to uncovering long-term trends within our rich historical data, our data empowers us to make informed decisions. In this article, we delve into how our data engineering processes enable us to achieve fast and precise decision-making across the organisation.

‍

Deriv’s data platform architecture

Deriv’s data platform is a well-designed ecosystem that integrates various data sources, including relational databases, APIs, and external files, into a centralised data warehouse. This architecture, visualised in the flowchart, allows for efficient data processing, storage, and consumption across the organisation.

‍

Flowchart of Deriv's data platform architecture — *Figure 1. Overview of data platform*

The foundation of our data platform: Key data sources

Deriv's data ecosystem draws from a diverse array of sources, much like a city's various infrastructure networks. The data engineering team expertly manages the collection and integration of data from relational databases, APIs, and external files, ensuring a comprehensive and reliable data foundation.

‍

Workflow orchestration: Leveraging airflow DAGs

To automate and streamline the data pipelines, our data engineering team utilises Airflow, a powerful workflow orchestration tool. Think of it as a traffic control system for our data highways. Airflow’s Directed Acyclic Graphs (DAGs) help schedule and manage the execution of various data processing tasks, ensuring the timely delivery of data to the different teams. Just as traffic patterns vary throughout the day, some of our data tables update daily with product metrics, while others capture per-minute user login activities and even real-time trade data.

‍

Relational database management system

To minimise disruptions to the main business operations, the data engineering team leverages read replicas and configuration management tools when working with relational databases. This approach, illustrated in Figure 2, allows them to maintain data integrity and reliability while seamlessly integrating the data into the data warehouse. It’s akin to having a backup power grid that ensures continuous operations, even during maintenance or upgrades.

*Figure 2. Illustration of configuration management*

How to populate the data warehouse

Deriv uses two data warehouse systems: a centralised relational database and Google BigQuery. Before loading data into BigQuery, the data is typically processed in the centralised relational database, where transformations and aggregations are applied using SQL functions. This is like refining raw materials before sending them to a factory for further processing.

In BigQuery, the team generates data marts from scheduled queries, which feed visualisation tools, regulatory reports, and machine learning models. These data marts can be thought of as specialised warehouses within the larger facility, each catering to specific needs.

‍

Why we still need relational database

The centralised relational database acts as an intermediary between data sources and BigQuery. By residing on the same network as the transactional database replicas, the team can conveniently access the data. The data is processed in batches, refined, and moved to corresponding schemas. It’s similar to a staging area where goods are sorted and prepared for shipment to their final destinations.

To achieve near real-time ingestion into the centralised data warehouse, the team uses a custom remote-data-access module to query data directly from the read replicas without source-side scripting. This allows for faster data updates, similar to express delivery services that bypass traditional shipping routes.

‍

Google BigQuery: Our scalable cloud data warehouse

The next layer of the data warehouse is BigQuery. We chose it for the following reasons:

Serverless architecture: We don’t need to worry about resource management when executing queries. Its columnar storage is optimised for analytical workloads, efficiently scanning sparse column subsets. It’s like having a self-driving car that adapts to traffic conditions and optimises its route for maximum efficiency.

‍

Table comparing record-oriented vs column-oriented storage — *Figure 3. Record-oriented vs Column-oriented storage from* *Bigquery docs*

Easy permission management: We ensure each team only accesses relevant data by using a built-in access management service. This is similar to having secure access control systems in a building, granting entry only to authorised personnel.

‍

Data shipping: Multi-method approaches for BigQuery loading

We use various methods to load data into BigQuery:

Batch processing: Pipeline loads batches of staged files from cloud storage into BigQuery tables for a specific scheduled interval. This is like scheduled cargo deliveries that arrive at set times.
Stream processing: We use a streaming API to process data in near real-time. This is like a live news feed that constantly updates with the latest information.
Event-based trigger and external tables: Sensor-based pipeline processes new files arriving in cloud storage buckets, applying transformations and kicking off relevant pipeline logics. This is like a motion sensor that triggers an alarm when something new enters its field of view.
BigQuery external tables: Access data from files in external storage. This is like accessing files from a remote server.

‍

An illustration showing simplified dataflow to BigQuery — *Figure 4. Simplified dataflow to BigQuery*

‍

Maximising value: Our dual BigQuery pricing model strategy

To optimise costs, Deriv utilises two BigQuery pricing models: On-Demand, which charges per query based on scanned bytes, and Capacity, which utilises reserved compute slots.

The On-Demand model is used for data ingestion pipelines and scheduled queries, while the Capacity model is used for ad-hoc queries, dashboards, and reports.

This separation ensures that ad-hoc reports do not interfere with the primary data pipeline execution and allows for better control of query costs. It's like having both a pay-as-you-go plan and a monthly subscription for your phone, depending on your usage patterns.

‍

Data monitoring tools: Our approach to logging, alerting, and monitoring

Think of a health monitoring system that keeps track of your vital signs and alerts you to any potential problems. In a similar manner, we monitor the pipeline for issues such as server failures, data anomalies, or bugs, ensuring the reliability of our data and system.

An illustration of alerting and monitoring data stacks — *Figure 5. Alerting and monitoring stacks*

Here are a few ways we monitor our system:

Customisable SQL dashboards to look out for data anomalies in business metrics. These dashboards are like the gauges on a car's dashboard, providing real-time information on the system's performance.
Log analytics to monitor instances and alert for any failed pipelines. This is similar to having a black box that records everything that happens in the system, allowing us to investigate any incidents.
Metadata management to keep track of data changes within our system. This is like having a version control system that tracks changes to documents over time.

We integrate with email and Slack providers to receive notifications from our alerting tools so we can be immediately informed of any critical issues and take swift action.

‍

Data quality: Maintaining data integrity across systems

Maintaining data quality is crucial for Deriv. The team has created a custom data quality pipeline to check for duplicates and ensure consistent data counts across data warehouses. In the centralised database, we use primary and unique key constraints, while for BigQuery, which lacks enforced primary keys, we created a metadata table defining primary key columns. Additionally, we audit data volumes across layers, ensuring that the record counts ingested into BigQuery match those in the first relational database layer for all tables on a daily basis.

‍

Docker containerisation: Deploying key applications in containers

We deploy applications in a Docker swarm using container management software. This allows us to manage container deployments, networks, volumes, credentials, and scaling across instances. Updating a docker service’s version only requires editing the docker-compose.yml file and redeployment, with easy rollbacks if needed. It's like having pre-fabricated building blocks that can be easily assembled and disassembled, allowing for flexible and efficient construction.

An illustration of Docker swarm — *Figure 6. Docker swarm illustration from* *Docker docs*

Security best practices: Ensuring secure data management

We maintain distinct staging and production environments, each with unique networking configurations. The staging environment enables us to test changes safely before deploying them to production, thereby reducing risk. Additionally, we implement two-factor authentication for accessing our services, providing an extra layer of security.

We regularly update our codebase and infrastructure in accordance with security best practices. A designated team audits for vulnerabilities, updating tools, rotating credentials, and decommissioning legacy systems. Data access adheres to the principle of least privilege within each department, with additional contexts, such as PII, requiring specific approval.

Data Engineering's role in shaping Deriv's data-driven future

Our Data Engineering team focuses on building a data infrastructure that integrates various data sources into the data warehouse. Logging, monitoring, and security best practices ensure data quality and reliability.

While the details in this article may be subject to change, our fundamental commitment to facilitating fast and precise decision-making remains unchanged. We value your input and encourage you to share your thoughts on our strategy.

‍

About the author

Fauzan Ragitya is a Data Engineer at Deriv with a passion for lifelong learning. He’s constantly expanding his expertise in data analytics and data infrastructure.

We’re proud to announce that Deriv’s engineering team has bagged the Silver Award for Engineering Excellence at the 2024 O’Reilly Awards, presented by O'Reilly Media - a company inspiring the future for more than 45 years by sharing the knowledge and teaching the skills people need to change their world.

March 22, 2024

Our VP of Engineering, Raunak, shares why Deriv is one of the best fintech companies to work for: an open work culture that empowers an innovative engineering team.

Video thumbnail featuring Deriv VP of engineering

March 22, 2024

Join our VP of Engineering, Raunak, as he shares his journey from a senior developer to the VP of a fast-growing team.

Jean-Yves Sireau, founder and co-CEO of Deriv, stands with arms crossed in a red T-shirt. The background features 'Deriv’s Vision for 2025' and derivTech branding.

January 20, 2025

Find out how the integration of AI in fintech is revolutionising our 2025 plans, enhancing operational efficiency and fostering innovation.

April 7, 2025

Essential GitHub Actions security best practices to defend CI/CD pipelines. CI/CD tool that automates testing, building, and deployment applications

March 21, 2025

How can low-code AI tools improve your system development? Learn how they helped us build intelligent, adaptable components with ease.

AI innovations powering Deriv engineering future

March 17, 2025

At Deriv, we're riding the wave of AI innovations. Read how we're integrating AI to automate operations and drive smarter financial outcomes.

System architecture showing user application tools like Webflow and OutSystems, API services through BuildShip and Fastgen connected via AWS API Gateway, and databases such as Snowflake, Supabase, and ClickHouse in the data layer.

January 16, 2025

In the fast-paced world of software development, simplicity and agility are key to building systems that can scale and adapt to changing requirements. A well-thought-out engineering architecture can provide this foundation. Dive into our approach to engineering architecture, which is tailored for the low-code and AI-driven world.

November 22, 2024

Discover how Deriv optimises its global trading infrastructure using AWS Global Accelerator. Learn about our approach to enhancing performance, ensuring reliability, and maintaining security for traders worldwide.

September 19, 2024

In today’s data-driven world, have you ever felt overwhelmed by the sheer volume of information or frustrated by the lack of a unified, accessible data infrastructure?

An illustration of memory leak analysis on Devel mat.

March 18, 2024

Read our blog to unravel a case of slow memory growth and explore how the latest features in Devel::MAT version 0.38 can solve the issue.

Developers working on an asynchronous code project.

March 18, 2024

Explore our blog to discover how employing Futures for asynchronous code can streamline unit testing and simplify your development process.

An illustration on using Async Await Syntax to rewrite code.

March 18, 2024

Learn code rewriting techniques with Futures Async Await syntax, seamlessly incorporating the new approach while maintaining existing behaviour.

An illustration representing how chrome extension service workers work

May 10, 2024

When you install an extension, all the background work is handled by service workers. Find out how to manage these tools more effectively.

An illustration representing Derivs UX and UI design system

June 20, 2024

Quill is a UX/UI design system that streamlines design for our evolving products and synergizes efforts for quicker and better results.

An illustration of cloud agnostic Kubernetes on AWS

March 27, 2024

Explore our journey from late-night server maintenance to mastering Kubernetes on AWS, gaining valuable insight into scalability and resilience.

Illustration of cybersecurity in software development

March 19, 2024

Discover how we enhanced our approach to secret detection and prevention, leading to stronger cybersecurity practices in software development.

DevOps peformance metrics on a modern laptop

May 10, 2024

Find out how we're using DORA metrics and other DevOps indicators to improve performance and create a culture of data-driven ownership.

An illustration representing innovation in trading

March 18, 2024

2023 unfolded like a dynamic tech map for Deriv. Find out how we prepared ourselves for more trading innovations.

An illustration showing connection between AWS and GCP

March 18, 2024

Check out our blog to learn how we designed and implemented the GCP and AWS connection using the Border Gateway Protocol.

An illustration representing WebAssembly module

March 18, 2024

In this blog, we'll go into the details and best practices in WebAssembly that we gained from our proof-of-concept experience.

An Illustration on Windows Server Automation

March 18, 2024

Find out how we automated installations of applications that communicate with external third parties for analytics and monitoring purposes.

March 18, 2024

When you have 50+ cloud servers to handle MT5 platform operations, automating the process is a necessity for operational reliability. Learn more

An illustration showing migration using Async Await

March 18, 2024

When rewriting code to use the new async/await syntax by Future::AsyncAwait, you need to keep a few things in mind. Learn more in our blog.

An illustration signifying async await syntax

March 18, 2024

In this blog our tech experts share their insights on Future::AsyncAwait, how it will develop, and how it should be implemented.

March 18, 2024

We decided to move a few of our services from AWS to GCP. Explore our solution for the NAT Gateway issue in the GCP to AWS site-to-site VPN.

March 18, 2024

Discover the transformative capabilities of Future::AsyncAwait, a syntax module, enhancing the clarity and expressiveness of Future-based code.

An illustration on using Chrome extensions for WebSocket debugging

March 18, 2024

This article discusses how you can use Chrome extensions in conjunction with the debugger to intercept and decode WebSocket traffic. Using a practical example named "Deriv WebSocket Trace,” it will walk you through quickly setting up a Chrome extension for WebSocket debugging