As Generative AI and GPT grow more popular and more developed, every industry is asking the same question: how can we use it to get ahead? Data management is no different. With the capacity to create complex code and perform basic data management procedures, some would say these models are already deployable for your data systems.
I still think we have a little ways to go before we can start screaming "AI solves everything!" from the mountaintops. Check out my thoughts on what GPT means for data management below.
Can GPT Improve Data Management?
I believe generative AI is a complete game-changer that impacts and will continue to impact pretty much everyone. However, in the interest of not getting too far ahead of ourselves, it's always important to understand the applications of GPT (and its limitations) at the most fundamental level. Then, we can move on to discussing and realizing its full potential.
When it comes to data management, you can try some simple questions in the basic ChatGPT interface to test its skills. Even with very basic prompts, you can see that the tool shows the capacity to understand data:
Here, it describes a data table:
Here, you can see it suggesting business definitions of DQ rules for a given table:
It can also write a statement in SQL to evaluate defined rules:
As a proof of concept, these prompts work really well. However, you’d still face many issues if you tried to use this basic approach at scale on your data. The first problems you’d encounter are related to the model's capabilities — mainly its limited knowledge, limited context, and tendency to hallucinate answers that sound plausible but are completely made up.
Engineering around the limitations of GPT
While these challenges seem significant, there are reasonable solutions for each. You just have to engineer your way around them. You can provide relevant context and use concepts such as semantic search, embeddings, and vector databases to pre-filter relevant facts and let the LLM use them to answer. You can also teach the AI to use external tools to mitigate its weaknesses. Ultimately, you can combine techniques like these to make the AI do whatever you want, including referencing its sources, verifying its own work, etc.
OpenAI released ChatGPT+ with a browsing mode and plugins in May. Below is a great comparison of how the simple usage of a tool can overcome basic LLM limitations:
In July, OpenAI also released the Code Interpreter. It was a great showcase of overcoming many limitations of a standalone large language model. This wasn't available when we were deciding how to start using large language models, but it is the simplest way to demonstrate some of the principles. Let's take a look at an example of how it can tackle a simple data management problem:
Note that even though the output is long and with many, many steps, there was no back and forth between me and the model. My input throughout the whole process was just one sentence in the beginning. What we can see is a nice iterative approach, essentially doing these steps:
- Ask LLM what needs to be done
- Unless the answer is obvious already, LLM writes Python code that will help to get the desired answer.
- Python code gets executed
- Return to step 1
This approach is fairly generic and applicable to many fields. It can be combined with other techniques, like the browsing mentioned above or with a local knowledge base.
Is GPT ready to change the way we work with data?
So, all problems solved? Data management done by AI is here? Not quite yet. Hidden between the lines is the cumbersome engineering effort required to develop and polish these solutions. If you take a closer look at what the code interpreter did in the example, it was impressive. Especially considering how little effort it took and that the tool was not even primarily developed for these use cases. However, to get valuable outputs in real life, we would need to do quite a bit more work.
It's absolutely amazing that we now have all the building blocks to do complex things automatically, and the end result feels like “magic.” But it’s important to keep in mind that the common message, “AI solves everything,” is a bit too optimistic at this point. AI will help us. It's a great tool. But we are the ones who have to solve the problem, at least for the time being.
Honeycomb published a great post about this: All the Hard Stuff Nobody Talks About when Building Products with LLM.
What does all this mean for Ataccama and the data management industry?
We believe LLMs will have a tremendous impact on our business, but it will involve a lot of work. So, where do we start?
We organized an internal hackathon around LLMs to get our top minds thinking about how to use this technology to innovate and change how we work with data. You can read more details here. If you want to stay updated on the latest developments surrounding AI and data management, follow me on Medium. Also, attend our Generative AI event to engage with thought leaders across the industry.