Where Is My Large Data Model?

Product

Use cases

About

Events

Engineering

Where Is My Large Data Model?

Published: March 28, 2023

Alekh Jindal

Share this post

This is a follow-up post on connecting advancements in data and AI. Check out the previous post here.

AI continues going to places with OpenAI announcing app-store style plugins for ChatGPT, Databricks showing how to build ChatGPT-like magic entirely with open source, and Microsoft claiming to have seen sparks of artificial general intelligence in GPT-4, all within the span of last one week. As someone working in data, I both admire and envy this incredible progress in large language models. Particularly, I wonder where my large data model is to solve many of the hard data problems that are still unsolved. Let’s explore this thought further below.

Traditional Data Models

The Wikipedia definition of data model is to organize elements on data and standardize how they relate. Classical textbooks on data modeling describe how the definition and format of data is crucial for developing information systems that share data across applications. Typically, a data model could of one of the three types:

Conceptual data model, as described by the semantics of the business.
Logical data model, as described by the data processing system.
Physical data model, as described by the physical data storage.

Databases also have physical and logical data independence, wherein the changes in the physical or logical schema or view of the data should not impact the applications, i.e., they do not need to be rewritten. Physical schema changes include modifications how the data is stored, indexed, partitioned, etc. Logical schema changes include modifications to table or view definitions, e.g., change how they are computed or materialized.

Data Models in Modern Data Stack

Data modeling plays a critical role in modern data stack, where data is quickly extracted and loaded into a central data processing system, e.g., a data cloud or a lake house, for all the transformations to run afterwards, also known as the ELT paradigm. These transformations include curating, selecting, combining, aggregating, and applying custom logic in various ways. Essentially, these transformations encapsulate the logic needed to power the business apps and they are typically managed by data analysts or analytics engineers, who act as the connector between the data and business worlds, as illustrated below.

Figure 1: The data analyst or the analytics engineer connects the data and business worlds.

The above transformations could also be in stages, e.g., the medallion design pattern in Databricks Lakehouse, or in hierarchical pipelines, e.g., the Asimov data pipelines in Microsoft’s Cosmos data analytics. Data transformation and modeling are also crucial for machine learning and data science applications. For instance, the Peregrine workload optimization platform for optimizing data platforms using ML relies on transforming various sets of system logs and metrics into an intermediate representation. Even OpenAI asks for elaborate data preparation and high data quality when using the prompt dataset for fine-tuning. Clearly, there is no substitute for good data modeling.

Several tools exist for data modeling in the modern data stack. We illustrate a small subset of them in the figure below.

Figure 2: A subset of tools for data modeling in the modern data stack.

Given that modern data processing systems have columns stores with vectorized query processing for interactive performance, data analysts can directly run transformation queries whenever needed by the business app. However, for better performance and predictability, analysts typically save the transformation queries as materialized views (e.g., relational databases such as PostgreSQL), tasks (e.g., Snowflake), or pipelines (e.g., Databricks) in different data processing systems. Alternatively, they could run the transformations as repeatable scripts external to the data processing systems using tools such as DBT, Airflow, Astronomer, and others.

Apart from creating data models on the data processing platform side, analysts could also create data models on the application side. For example, they could use Looker’s LookML to define models and create persistent derived tables (PDTs). Or define saved queries in Superset, extracts in Tableau, imports in Power BI, preferred tables in BigQuery, and SPICE tables for Amazon QuickSight. Each of these application side mechanisms allow analysts to build application specific data models.

Towards Learned Data Models

Current approaches to data modeling in modern data stack are challenging on several counts. First, the above data modeling tools are all manual, requiring the data analyst to spend a painful amount of time and effort in handcrafting the right data models. As a result, it takes days and weeks before data analysts can surface the data for the business users to get insights. Moreover, there is an entire lifecycle of events — including deploying, updating, sharing, and optimizing data models — that is extremely tedious for the data analysts to manage. For example, updating the data models requires the analyst to be aware of the data arrival rates, the application requirements, and all the cascading dependencies on any given data model. Therefore, the question is whether we can do better, can AI help here?

It is worth noting that even though the space of data models is very large, the actual data models built by the analysts are not arbitrary. Instead, they are meaningful business logic with patterns that are often overlapping and like other logic their team (or even other teams) has written. For example, the Cosmos analytics platform at Microsoft routinely has teams with more than half of their analytics jobs overlapping with each other. These are natural patterns to learn and assist the data analyst. Additionally, many applications frameworks generate canned SQL statements, e.g., the symmetric aggregates in Looker ensures there is no SQL fanout. Such logic is based on well-defined rules that are useful to learn. Finally, users know a lot about their applications and their interactions could provide insights into which data models make sense. Overall, we see an interesting case for learning data models based on the data patterns, rules, and interactions.

Learned data models raise many obvious questions. First, what makes a data model better than others? What is the training objective? Some candidate properties to evaluate data models could be: 1. Quality, i.e., how clean or complete the data in a model is, 2. Correctness, i.e., whether a model follows well defined rules and heuristics, and 3. Interestingness, i.e., whether the model stands out in statistically defined metrics or as an interesting data pattern. Understanding more properties that differentiate data models is an open question.

A even bolder question is whether large data models could be pre-trained, i.e., whether data models could be learned independent of the data and application platforms. Indeed, current trends show an increasing interest to run SQL query over anything using tools like Starburst and to optimize those queries in a platform-agnostic manner using optimizer as a service. Likewise, tools like DBT encourage creating data models independently before using them in the application platform. Both these trends indicate a possibility to learn large data models!

Data Models vs Language Models

Finally, let’s step back and see how data models are similar or different from language models. Large language models have made answers to natural language questions accessible within minutes instead of searching through scores of link and webpages. Large data models can have a similar effect of delivering insights quickly (within minutes) instead of digging through a jungle of tables and crafting the right data models that can show insights. This opens up a deluge of interesting questions: Is a large data model possible? Aren’t large language models all we need? Given that several recent approaches to map natural language to SQL make people better at asking questions, how do we get better in answering them? How do we ensure correctness? Can we fine-tune to specific business context?

AI is defining a new world structure and the role of data is evolving along with it. While data continues to power pretty much everything around us, it is no longer at the frontend that operators want to figure out or fiddle with. Instead, the underpinning hope is for an automation that helps data show up magically, whenever, and however needed. Is that future possible?