Generative AI for Enterprise Data

Product

Use cases

About

Events

Engineering

Generative AI

Generative AI for Enterprise Data

Published: August 14, 2023

Alekh Jindal

Share this post

Generative AI has caught people's imagination for a variety of tasks — from coding to creative arts — that are otherwise tedious and hard. Enterprise data is likewise tedious and hard for many people, and so the question is whether generative AI can help. Data analytics, in particular, is a natural next task for generative AI with many new approaches coming up. We discuss these below.

Current Approaches

Current generative AI approaches for data analytics fall into three main categories, namely (1) text-to-SQL, (2) contextual, and (3) finetuning:

Text-to-SQL

Generating a SQL query from natural language is a popular technique for data analytics. The idea dates back to at least 1978 when William A. Martin presented "Natural Language Database Query System", called EQS, at MIT. However, recent advances have motivated a lot of people to build natural language interfaces for databases. These include numerous startups, e.g., text2sql.ai, seek.ai, defog.ai, nlsql.com, blazesql.com, and large vendors, e.g., ThoughtSpot, Sage, Microsoft Synapse Fabric, Amazon QuickSight, and Databricks English SDK.

Contextual

The other approach is to provide relevant data as context when sending prompts to the language models. Open AI Code Interpreter, for instance, can provide an entire data file as context. For larger data, retrieval augmented generation (RAG) can fetch relevant portions using a vector database, e.g., LlamaIndex. We can further improve prompting using few-shot, iterative, chain-of-thought, or tree-of-thought approaches to split the problem into smaller pieces.

Finetuning

The final approach is to finetune the language model on the given enterprise data. This requires a massive amount of resources and in-house expertise. Examples include Dolly and Bloomberg GPT, which were finetuned on data collected within their organizations.

Each of the above approaches has its own set of challenges. Let's discuss the typical ones below.

Challenge 1: Big Data

Enterprise data can quickly grow big, proverbially referred to as big data. This is not good for the text-to-SQL approach, which runs queries directly over the raw data. It is also limited to small schemas that can fit in the context size limits and requires sophisticated prompting, which could end up describing every single operator, to get it right on complex schemas. The contextual approach is likewise limited by context size limits, e.g., a few 100 MBs in Code Interpreter or 32k in GPT-4.

This is reminiscent of the quote "Nobody will ever need more than 640K RAM" that is famously attributed to Bill Gates.

Moreover, new results show that the order of data matters as context size grows. Finally, note that big data makes finetuning even more resource-intensive, with GPT models costing between 4–100 million dollars and BloombergGPT requiring nearly 1.3 million GPU hours!

Challenge 2: Hallucination

Data analytics requires accurate answers, however, recent studies from OpenAI show GPT model accuracy ranging from 50–80%. In fact, OpenAI recommends that:

"Great care should be taken when using language model outputs, particularly in high-stakes contexts, with the exact protocol (such as human review, grounding with additional context, or avoiding high-stakes uses altogether) matching the needs of specific applications." — GPT-4 Technical Report.

No wonder text-to-SQL approaches have accuracy ranging between 50–85% on popular leaderboards like Spider and Bird. More importantly, users need to be experts in SQL so that they can spot and fix errors. The contextual and fine-tuning approaches reduce hallucination, but they need a lot of training data or context, and they still do not eliminate it entirely.

Challenge 3: Data Leak

Data leak is a major concern for enterprises. In fact, an increasing number of companies, including Amazon, Apple, Samsung, Verizon, Bank of America, Goldman Sachs, and others, have banned ChatGPT. Likewise, many countries have blocked access to ChatGPT due to privacy concerns.

"We ask that you diligently adhere to our security guideline and failure to do so may result in a breach or compromise of company information resulting in disciplinary action up to and including termination of employment," — Samsung memo, May 2, 2023.

While text-to-SQL needs to share complete database schemas, retrieval augment generation (RAG) pulls data out from enterprise data sources and sends it to the language models. Using a vector database also requires duplicating data as embeddings into the vector store.

Challenge 4: Interactivity

Modern analytics demands interactive query performance. Typically, interactivity implies response times of within 2 seconds. Unfortunately, text-to-SQL approaches generate direct queries over the raw data that can end up slow. This is contrary to most analytics platforms that provide mechanisms to transform and aggregate data before querying them interactively, e.g., Imports in Power BI, Extracts in Tableau, PDTs in Looker, Preferred Tables in BigQuery BI engine, Saved Queries in Superset, and SPICE in Amazon QuickSight.

Contextual approaches, on the other hand, rely on large context sizes. Unfortunately, the response times of language models are progressively increasing with larger context, e.g., typical latency of more than 10 seconds for GPT-4.

The Ideal Scenario

Ideally, analytics on enterprise data needs to scale to large data sizes, it must produce accurate answers, minimize data leaks, and be highly interactive. Is this ideal picture possible?

The Good ol' Data Models

Tursio is taking a brand-new approach to generative AI on enterprise data. We build on the well-known concept of data models and apply a generative approach to it. Specifically, we introduce a Large Data Model (LDM) that pre-generates data models and retrieves the relevant ones in response to user queries. Since queries are executed against data models, without any embedding of the physical data, they can scale to arbitrary data sizes. All answers are rooted in data models and hence guaranteed to be correct. Furthermore, data stays within the database at all times, thus providing better privacy and lower leakage risk. Finally, Tursio manages all data models with intelligent cache and refresh, making them highly optimized for interactive performance.

Together, the LDM and the LLM form the left and right brains, respectively, combining factfulness and creativity for modern intelligence.

Applications

Application 1: Automated Q&A

Tursio makes data accessible to everyone within an organization, without sacrificing accuracy, privacy, interactivity, and scale. Users can start asking natural language questions, and the system generates the corresponding data models that are shared with other users.

Application 2: Automated Dashboards

Tursio generates visualizations for every data model, and users can pin those visualizations onto a dashboard. Thereafter, Tursio also takes care of managing the data models, i.e., generating data pipelines and refresh, thus making dashboarding a no-code affair.

Application 3: Automated Monitoring

Data models change over time, and it is hard for users to keep track of the growing number of data models. Tursio helps monitor data models for any unusual behavior and surfaces them if they require attention.

Application 4: Automated Reports

The end goal of analytics is to summarize the actions that need to be taken into a report. Tursio can generate reports over one or more data models, indicating their behavior, trends, and insights.

Summary

A Large Data Model can solve many of the challenges in applying generative AI on enterprise data. It simplifies analytics and makes it accessible to a wider audience. But more importantly, it learns to automate tedious manual tasks that would otherwise consume valuable human time.

To learn more about our approach, visit https://www.tursio.ai or write to us at contact@tursio.ai

Bring search to your
workflows

See how Tursio helps you work faster, smarter, and more securely.

Use cases

Symitar

SQL Server

Cassandra

Resources

Blog

News

Research

Webinars

Whitepapers

Case studies

About

Our story

Leadership

Data protection

Highest industry standards

Other

Terms & Conditions

Generative AI for Enterprise Data

Current Approaches

Text-to-SQL

Contextual

Finetuning

Challenge 1: Big Data

Challenge 2: Hallucination

Challenge 3: Data Leak

Challenge 4: Interactivity

The Ideal Scenario

The Good ol' Data Models

Applications

Application 1: Automated Q&A

Application 2: Automated Dashboards

Application 3: Automated Monitoring

Application 4: Automated Reports

Summary

Related blog posts

Analyzing MCP Token Costs

The Context Structuring Problem

Making MCPs Practical with Context Pushdown

Bring search to your
workflows

Use cases

Resources

About

Data protection

Other

Generative AI for Enterprise Data

Current Approaches

Text-to-SQL

Contextual

Finetuning

Challenge 1: Big Data

Challenge 2: Hallucination

Challenge 3: Data Leak

Challenge 4: Interactivity

The Ideal Scenario

The Good ol' Data Models

Applications

Application 1: Automated Q&A

Application 2: Automated Dashboards

Application 3: Automated Monitoring

Application 4: Automated Reports

Summary

Related blog posts

Analyzing MCP Token Costs

The Context Structuring Problem

Making MCPs Practical with Context Pushdown

Bring search to your workflows

Use cases

Resources

About

Data protection

Other

Bring search to your
workflows