Database Workloads: A Song of Ice and Fire

Product

Use cases

About

Events

Databases

Engineering

Generative AI

Database Workloads: A Song of Ice and Fire

Published: May 28, 2024

Alekh Jindal

Share this post

Database Workloads: A Song of Ice and Fire

Databases have evolved to serve business workloads over the last 50 years. However, in 2005, Mike Stonebraker and Uğur Çetintemel described the previous 25 years of database development as “One size fits all” [1], i.e., referring to the monolithic database architectures for all types of workloads. Of course, the other extreme would be to build a new database for every new workload. Let us look at how the 20 years since the Stonebraker paper fared between these two extremes.

2010: Big Data

2010s saw a huge wave of large-scale data analytics, inspired by Google’s data processing infrastructure and popularized by the open-source Hadoop data stack. The argument was that big data is not a database workload anymore, and it cannot fit into traditional database architectures. Hadoop, in particular, was considered a completely new platform to process unstructured or semi-structured data, write MapReduce programs, and scale massively to a large number of machines. All this at much lower performance and flexibility compared to databases.

While researchers were still debating whether MapReduce is friend or foe with parallel DBMSs, practitioners were already busy deploying Hadoop into their data stack. The main drivers for this active interest were ease of use — people can start running Hadoop directly on their existing files (or data lakes) and write simpler imperative programs without being a SQL expert— and lower cost — no need to pay expensive license cost of DBMSs, build complex ingestion pipelines, or hire expert DBAs before they can make the databases run.

Over time, however, people realized that Hadoop and big data platforms resemble databases in more ways than they had imagined. Hadoop implements all typical database operations, but they are hard-coded into a static execution plan. Making that flexible and allowing for alternate implementations of those operators, just like databases — can make Hadoop equally flexible and performant [2]. Furthermore, the query language can still be structured with numerous extensions. This philosophy was espoused by efforts like Hive, LLAP, Impala, SCOPE, Spark, BigQuery, among others.

Today, big data platforms have morphed into modern cloud-native data warehousing platforms. They have all the elements of traditional database architecture, including SQL, query optimization, query processing, data layouts, partitioning, indexing, materialized views, and so on. They also have support for open formats to process data directly from the data lake, and provide a lot of tooling and automation, e.g., in-built partitioning, auto-scaling, workload management, etc., for a better user experience. Hadoop may not be deployed anymore, but the workloads inspired by Hadoop have been fully assimilated by database architectures.

2015: Graph Analytics

Graph analytics became a hot topic around 2015s, popularized by new applications in social media, ecommerce, retail, transportation, recommendation systems, and web search. The core idea was that linked data is different and does not fit existing databases. It involves operations like graph traversals, shortest paths, spanning trees, cliques, etc., over large graphs, which is hard to support in databases.

Similar to MapReduce, Google introduced another data processing framework, called Pregel, for running graph analytics in bulk synchronous parallel (BSP) fashion. This design gained popularity with an open source implementation, called Giraph, on top of Hadoop, and a commercial implementation called GraphLab. Several other extensions continued developing these specialized graph systems further.

Over time, similar to Hadoop, people realized that graph analytics can also be supported in relational databases. Similar to Hadoop, Giraph also has a static execution plan that processes parallel vertex computations in supersteps. This can be expressed as a database query execution plan and optimized using a combination of SQL and user-defined functions [3]. Furthermore, column store layouts can perform fast self-joins to traverse the graph iteratively.

Today, many databases support graph analytics, including SQL Server, Oracle, Teradata, Spark, among others, and even specialized graph systems have a database-oriented architecture. Application developers can combine SQL and graph operations in the same database while still having the flexibility and performance of Pregel.

2020: Machine Learning, Data Science

Machine learning and data science became extremely hot topics by the 2020s. Once again, Google came up with Tensorflow to democratize ML platforms and data science, being dubbed the “sexiest job of the 21st century,” helped fuel the wave. ML workloads were seen as different from database workloads, and they needed new set of tools for training and deployment.

The prevailing wisdom was to consider the end-to-end machine learning lifecycle and build new platforms right from feature engineering all the way to model tracking and inference. Data scientists predominantly operate in Python, which has also become one of the most popular languages with a comprehensive ecosystem of libraries and toolchains. All of these were completely isolated from the databases.

Yet again, people soon realized that machine learning and databases are better off being close to one another. On one hand, people started building database extensions to run ML workloads natively within a database, e.g., ML Services in SQL Server or SQL extensions in BigQuery. On the other hand, it became possible to push down data science programs written in Python into scalable database platforms [4]. These developments have led to better integrated architectures where ML and data scientists can work right on top of the data in their databases.

2025: Generative AI

Generative AI is changing how businesses operate. This time, it was not Google but rather OpenAI that had the sputnik moment with ChatGPT. For enterprises, while 2023 was the year of exploration (what is possible), 2024 seems the year of evaluation (will it work for me), 2025 is likely to be the year of much-awaited evolution (making it for real).

There is a general consensus that AI is only as good as the data. However, the current approach is to treat data as a kitchen sink and throw it into new generative AI systems. These include systems for model finetuning, vector databases for retrieval augmentation, or toolchains to feed massive prompts to the LLMs. The current belief is that none of these can happen within a database, and we need to build completely new generative AI platforms. However, interestingly, we already see many database vendors building their own in-situ vector indexes. So we know where this is going.

Will generative AI be yet another workload that could be completely operated within a database? At Tursio, we believe so and are on a mission to turn databases into generative AI machines [5]. The upsides are obvious — data stays within the database, simplified architecture lowers cost and risk, accelerated time to develop and deploy, all using design principles that have been perfected over the last 50 years.

To conclude, does “One size fits all”? While the jury may still be out, databases have time and again emerged as a free-size that can fit many.

References

“One size fits all”: an idea whose time has come and gone, Michael Stonebraker and Ugur Çetintemel, International Conference on Data Engineering, 2005.
Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing), Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Jorg Schad, International Conference on Very Large Data Bases, 2010.
Graph analytics using vertica relational database, Alekh Jindal, Samuel Madden, Malú Castellanos, Meichun Hsu, IEEE International Conference on Big Data, 2015.
Magpie: Python at Speed and Scale using Cloud Backends, Alekh Jindal, Venkatesh Emani, Maureen Daum, Olga Poppe, Brandon Haynes, Anna Pavlenko, Ayushi Gupta, Karthik Ramachandra, Carlo Curino, Andreas Mueller, Wentao Wu, Hiren Patel, Conference on Innovative Data Systems Research, 2021.
Turning Databases Into Generative AI Machines, Alekh Jindal, Shi Qiao, Sathwik Reddy Madhula, Kanupriya Raheja, Sandhya Jain, Conference on Innovative Data Systems Research, 2024.