This post is the first in a series where I talk about building an Open Source AI Data Analyst using LLMs. If you’re familiar with python, you can follow along with my experiments on my Github repo.
A while ago, I began wondering if it would be possible to build an AI data analyst that is capable of understanding all of your company’s data and can respond to questions about the business.
The idea isn’t a new one. If you’ve worked in data for any length of time, you’ve probably been asked to build a self-serve analytics platform of some sort. However so far, when we think of self-serve analytics, we imagine a platform where all your metrics and dimensions are clearly defined - where business users are able to analyse and plot data easily through a point and click interface.
Now we’ll just takes the same idea one step further - why even bother with a point and click interface. After all, creating dashboards is not the end goal of having a data team - getting insights is.
So the motivation is simple:
What if you could simply ask an AI agent questions about your business, and have it respond with the insights you need to make decisions?
The hypothesis
So what do we need for this idea to work?
You should be able to ask questions about your business using natural language including vernacular used in your organisation.
The AI agent should be able to search through your warehouse to look for data sources or models that contain the information you need to respond to the question.
The agent should be able to read these sources and respond to the question with sql queries, charts or the dashboards that give you the insights you need.
You (or an observer) should be able to tell the agent if the answer is correct, complete and satisfactory. In case it isn’t, you should be able to offer corrections that the AI agent learns for subsequent questions about the same topic.
If you can get all of these pieces to work together, you’ve got an AI agent that can perform the essential duties of a junior data analyst.
The challenge
The first step towards solving this problem is building an AI agent that is able to learn about and understand your organisation’s data. However there are as many ways of building data platforms and warehouses as there are companies in the world.
Let us examine how this poses a challenge.
Problem #1 - No global standards
It difficult to reliably use the names of tables and columns to interpret the data contained within.
There are no globally agreed upon standards in the universe of data engineering. This means most data engineers come up with their own ways of naming tables, columns and schemas etc. When there are some conventions, they usually exist at the level of the organisation and there is very little tooling available to enforce these standards automatically - which means you can’t be sure how well people adhere to them.
Problem #2 - No established roles
Clean and dirty data is often found together in the same place and it can be hard to determine the role of a table.
While there are well established design patterns for building a data warehouse e.g. using facts and dimensions as described by the star schema, they aren’t always followed. This is because the recent proliferation of tech companies with inexperienced data teams means that not all people know about them. Even when you’re aware of these patterns, sometimes pressure to deliver results quickly can understandably force you to take shortcuts and make compromises with the architecture of the warehouse.
Additionally, because of the myriad of ETL technologies available on the market and as legacy data is carried forward, it is often very difficult to differentiate a table that contains the correct version of a metric from one that’s an intermediate step of an ETL, or from one that’s a legacy definition that should no longer be used etc.
Problem #3 - Metric definitions may vary
The same metric names can mean different things at different organisations.
This one is self-explanatory. When an organisation says “Profit”, it could mean operating profit, or net profit. Similarly “Cost” can mean Landed Cost, or Cost of Good Sold etc.
On top of this, often there is implicit logic coded into the meaning of metrics which varies from organisation to organisation. e.g. CompanyABC does not include plans sold under the pricing plan titled “winter_seasonal_promotions” in the calculation of MRR.
Problem #4 - Lack of documentation
Not all organisations have sufficient explicit documentation covering their data.
Writing documentation is hard. It takes a lot of time and effort to document every part of your data pipeline and the decisions taken therein. Even when written, you need to update your documentation every time you push changes to your code.
As a result, never have I ever worked at an organisation that had accurate, exhaustive and up-to-date documentation of every part of their data platform.
Conclusion
All of the problems make it so that it’s really hard to build a generalised system that is capable of understanding the data of any organisation in the same way a person would.
However, the recent advances in LLMs are quite amazing. In the next part of this series, I’ll talk more about how I propose to reduce the size of the problem to make it easier to solve and my experiments in creating an MVP.
Good problem to solve. Best Wishes
Sanjiv