The AI Ambition and the Ground Reality

Product

Use cases

About

Events

Engineering

The AI Ambition and the Ground Reality

Published: March 17, 2023

Alekh Jindal

Share this post

This blog attempts to connect the dots between the present AI excitement to what it could mean for the future of data, while reconciling with some of the lessons learned from recent past.

AI-powered World

We are living in one of the most exciting times for AI with ChatGPT being the Sputnik moment that got the whole world scrambling for action — from creativity and productivity to efficiency and growth to life and material sciences — there is an endless list of what is possible with AI today and the world is changing faster than anyone can keep up with. Many believe generative AI is akin to the dawn of internet or the rise of mobile that unleashed a new wave of applications which could destroy existing industries and create new ones.

No wonder there is an unparalleled ambition to reimagine pretty much every part of our world, with generative AI applications in text, image, audio, video, gaming, code, chat, apps, and legal, to name a few. Recent count shows 210+ generative AI startups out there with 40+ generative AI companies backed by Y Combinator 2023 batch alone. And you know there is a new gold rush driving this crazy landscape when a year old startup announces raising $350 million, even though the product is yet to launch. Still, there are more fundamental questions as well. For instance, what will the future tech stack look like, will AI manipulate humans, and whether this be the end for Google? Although this article is entirely human-written written with Google still being the primary research tool.

While a new future of applications is getting unfolded with AI, what does AI mean for data? What about the enterprise data that largely remains unused or un-accessed even today? Is data getting better with AI?

AI-powered Data Analytics

There is a growing interest in leveraging AI for data analytics. This is driven by the need to make analytics self-serve and democratized for all units of business that are looking to turn more and more data into intelligence. In fact, experts believe more than 500 million intelligent applications are going to be built over the next few years, portending serious questions on the scalability and cost of the underlying data platforms. Furthermore, data analytics is getting more complex from requirements to actions, thereby demanding more user expertise. Yet, a quick search on LinkedIn reveals more than 7M “analysts” compared to only 200K “data engineers”, thus indicating far more non-expert users asking business questions than experts who can work on the data stack.

Recent approaches are trying to make analytics more accessible via conversational interfaces. These include both existing players, such as ThoughtSpot SearchIQ, Salesforce Tableau, Amazon QuickSight, and Microsoft Power BI, as well as new startups such as Seek, Defog, Ai2sql, Nlsql, Outerbase, ChatSpot, etc. The challenge, however, is to ensure correctness, i.e., conversations cannot lead to wrong results. Correctness is also a broader concern with generative AI (see attempts for accurate chatbots), and therefore, current conversational interfaces for data analytics are mostly suggestive and best efforts.

Apart from query interfaces, other efforts for AI-powered data analytics include Trifacta and DataRobot/Paxata for traditional data preparation, Lume for schema mapping, Turntable for assisted data modeling, and Keebo for warehouse optimization. Many of these are early-stage efforts, and many other problems, including data discovery, integration, models, pipelines, privacy, quality, operations, etc., remain a rich opportunity for AI to seep deeper into the data analytics stack and solve some of the harder problems there. Overall, we are still in the early days of making AI-powered data analytics practical and end-to-end.

Déjà vu: AI-powered Data Systems

Interestingly, a very similar story of infusing AI has been playing out for data systems in the last few years. Consider data platforms like Spark, Snowflake, or BigQuery, and the various system-level problems inside them, such as performance, scale, resource utilization, efficiency, cost, configurations, etc. Many of these problems have become incredibly complex in cloud with more sophisticated systems and workloads, less expert users (thanks to the ease of getting started), a lack of control in managed services, and too many moving parts with multiple layers of abstraction. As a result, it is hard for customers since they do not have the DBAs from older on-premises world, and hard for cloud service providers since they are grappling with a deluge of customer requests to meet the quality of service.

Together with my colleagues in Azure Data, I have previously spent several years at Microsoft building and deploying a gamut of AI-powered techniques for data systems. Let’s briefly look at three of them below.

1. Learned Cardinality Estimation

Data systems typically use a query optimizer to convert declarative SQL queries (represented as a tree of operators) to physical query execution plans (an optimized tree of operators). And the core of a query optimizer requires estimating cardinalities (i.e., the row count) at each point in the operator tree. Cardinalities help estimate the cost of different physical operator trees and thus pick the cheapest one for execution. Unfortunately, cardinality estimation has been a long-standing problem in databases:

"The root of all evil, the Achilles Heel of query optimization, is the estimation of the size of intermediate results, known as cardinalities" — Guy Lohman.

At Microsoft, cardinality estimation in the Cosmos big data workload ranges from 10,000 times under-estimation all the way to a million times over-estimation for different operator sub-trees, exacerbated particularly due to lack of statistics at massive scale, presence of large volumes of unstructured data, and quite generous use of user defined operators that are hard for the optimizer to reason about.

We exploited two observations for building CardLearner, the ML-based cardinality estimator: (i) the recurring nature of the workloads where similar jobs with different inputs and parameters were executed repeatedly, and (ii) a large number of jobs having similar sub-trees across them. Together, these provided an excellent training input for learning several small and highly accurate micromodels. These micromodels are then served to the query optimizer at compile time via an insights service.

Validation over production workloads showed a 95th error reducing by five orders of magnitude from 465711% to just 1%, leading to better cost estimates, lower job latencies, and, surprisingly, even lower number of containers used by the workload. CardLearner is the first ML-based cardinality estimation to be deployed in production, and the key to its success was 4+ years of hardening right from an intern project to being enabled by default for critical production workloads — handling numerous corner cases, performance regressions, and fallback mechanisms along the way.

2. Learned Resource Allocation

Cloud data systems and increasingly “serverless”, where users do not have to provision resources upfront, and the system can decide resources dynamically. However, most systems still allow users to provide hints when submitting queries, e.g., concurrency level in Snowflake, executor count in Spark, and token counts in SCOPE at Microsoft. Unfortunately, SCOPE users rarely make an informed decision:

"At no point did I feel I had a better understanding beyond this: more tokens mean faster job completion… [U]se minimal tokens (<50) for tiny jobs and as many tokens as possible otherwise" — SCOPE user.

No wonder 40–60% of the jobs in Cosmos have over-allocated tokens by as much as 1000x, thus blocking resources for other jobs in the shared cluster while also creating an artificial peak demand that is higher than actually needed (see here the production resource distributions). An ML-based model, AutoToken, that predicts the peak resource for each job has an RMSE of less than 10% (two orders of magnitude lower than the previous state of the art), with the token ask in one of the customer workloads reduced by 97%.

Interestingly, resources less than peak may not impact performance disproportionately. There is a sweet spot where reduced resources (from the peak needed) still have acceptable performance. Models for such optimal resource allocation can be built using careful experimentation, and the approach is applicable for other data systems, e.g., optimal executor counts in Spark. However, AI alone cannot substitute the need to model the problem carefully, discover the relationships between performance and cost, formulate it as a generalized model, and test and improve it repeatedly. This was the key to deploying learned resource allocation.

3. Learned Query Optimizer

The query optimizer has long been a signature component of relational database systems, with a rich history of both academic and industrial research around it. From the early days of System R to modern data systems, this component has evolved significantly, with the cascades architecture becoming popular with several database implementations, including SQL Server, SCOPE, Spark, Calcite, Greenplum, Snowflake, Spanner, and F1. Interestingly, many individuals who worked on Cascades have moved between companies, cross-pollinating a common set of ideas and design principles across industry. There is an interesting anecdote of the SCOPE team hiring engineers who had previously worked on a distant past version of the SCOPE codebase, a few acquisitions, mergers, and reorgs ago. It is no wonder then that people working on query optimization are scarce and highly sought after.

Given how fundamental query optimizers are to databases, there was a ChatGPT moment of sorts when people started proposing to replace the entire query optimizer with a learned one. This was an ambitious move, making query optimizer people wonder about the future of this complex area. Incidentally, researchers quickly realized that it was hard to completely replace query optimizers with learned ones, and so they proposed a more practical version that sits side by side with the existing query optimizer to help guide the query plan search.

My former team at Microsoft worked with the leading researchers on grounding these ideas into the industry-strength workloads in SCOPE. We noticed that SCOPE has 256 optimizer rules, leading to ²²⁵⁶ possible optimizer configurations instead of only 48 configurations considered in the original work. Therefore, we needed to divide this massive search space and came up with a rule signature that captures the code path that a query takes inside a query optimizer. We could then learn smaller, specialized models to steer the optimizer toward good paths. Still, performance regression is a big challenge when deploying AI for systems. The team came up with more innovations to carefully design pre-production experiments before the model could be deployed to production, a multi-year effort in taking the early AI excitement to production reality.

Lesson Learned

There are three key lessons to be learned from the above examples of deploying AI. First, it is tempting to think of replacing things with AI. Yet, AI is better off assisting someone or something that exists in doing a better job. Second, many people instinctively try to learn large global models that can predict everything; however, it is more practical to learn smaller local models that predict fewer things but with far more accuracy. Local models are also simpler, smaller, faster, and even explainable. And finally, the devil is in a lot of details, particularly when it comes to AI. It is non-trivial to make AI work reliably in any production setting and so it is important to consider the corner cases that can show up.

Conclusion

To conclude, many classic data problems are still not solved, and we continue to hear “data is the biggest blocker” from many practitioners. It will be interesting to see how the new wave of AI unfolds for data, and hopefully, we can draw lessons from some of the recent experiences that we as a community have had in the field.

Bring search to your
workflows

See how Tursio helps you work faster, smarter, and more securely.

Use cases

Symitar

SQL Server

Cassandra

Resources

Blog

News

Research

Webinars

Whitepapers

Case studies

About

Our story

Leadership

Data protection

Highest industry standards

Other

Terms & Conditions

The AI Ambition and the Ground Reality

AI-powered World

AI-powered Data Analytics

Déjà vu: AI-powered Data Systems

1. Learned Cardinality Estimation

2. Learned Resource Allocation

3. Learned Query Optimizer

Lesson Learned

Conclusion

Related blog posts

Analyzing MCP Token Costs

The Context Structuring Problem

Making MCPs Practical with Context Pushdown

Bring search to your
workflows

Use cases

Resources

About

Data protection

Other

The AI Ambition and the Ground Reality

AI-powered World

AI-powered Data Analytics

Déjà vu: AI-powered Data Systems

1. Learned Cardinality Estimation

2. Learned Resource Allocation

3. Learned Query Optimizer

Lesson Learned

Conclusion

Related blog posts

Analyzing MCP Token Costs

The Context Structuring Problem

Making MCPs Practical with Context Pushdown

Bring search to your workflows

Use cases

Resources

About

Data protection

Other

Bring search to your
workflows