Co-CEO at Fluree, the scalable semantic graph database backed by blockchain technology.
As incredible as large language models (LLMs) are, enterprises can’t take full advantage. Even the most common use cases—customer service chatbots, marketing writing and code assistance with Copilot—aren’t reliable. In February 2024, Air Canada was forced to pay damages to a customer after its chatbot made up a policy about bereavement fares. Legal hallucinations are an ongoing problem. February 2024 also saw ChatGPT respond to English prompts in Spanglish.
How can LLMs break outside of their three constrained use cases and become more trustworthy and accurate? The answer lies in giving LLMs safe, protected access to data outside of what they’re trained upon. Only when LLMs expand from model-centric AI to data-centric AI will they become reliable enough for broad enterprise use.
Model-Centric AI
Out of the box, an LLM only uses the knowledge upon which it was trained. Unless you update the model, it does not have current data. Fine-tuning is the process of continually training a model with new data. Unless you train the model every day—an expensive, ongoing process—it can’t use real-time information in its answers. It will instead use outdated information. Combine that with an LLM’s tendency to hallucinate answers, and it becomes clear that LLMs aren’t up to snuff for enterprise use.
Because fine-tuning is expensive and time-consuming, most organizations don’t see the ROI in doing it. There is, however, another way for LLMs to reach beyond their training data to generate credible, reliable answers. The process is known as data-centric AI.
Data-Centric AI
Data-centric AI, also known as runtime AI, enables an LLM to fetch valid, real-time data outside of the model to inform its answers. The LLM uses an API to pose queries to external sources—for example, to an SQL database. After receiving an answer, the LLM incorporates it into its reply.
There are several benefits to this approach. For one, organizations bypass the labor and expense of tuning a model. Secondly, the LLM saves users from having to write queries in code such as SQL. Instead, users can ask general questions like, “What’s in our pipeline?” The LLM can then translate the question into a query that generates a CRM-informed sales reply. Because the model is pre-programmed to respond to database queries rather than generating its own answers, it is less likely to hallucinate.
The process is much more cost effective and simpler than fine-tuning. This is particularly true in light of the commoditization of AI models, which are becoming cheaper to train and run. Whether your use case is pharmaceutical research or a customer-service chatbot, your range of options is expanding as the wave of AI innovation grows. Instead of trying to wire everything on top of a fancy model like ChatGPT, you can pick a lower-bit model for simpler use cases.
Build In Data Safeguards
Having an LLM use an API to retrieve current data may lead to more accurate responses, but it doesn’t ensure that your data remains safe or private. Microsoft Copilot, for example, can access any Microsoft 365 data that is open to employees. The problem compounds when employees use Copilot to help generate new, proprietary code.
You’ll need to build safeguards to protect data as it travels to and through the LLM. For example, if an LLM draws upon SQL queries, your organization should automatically intercept the model’s response. This ensures that your organization is the one that executes the SQL query, not the LLM. Rather, the LLM is only capable of creating a query that can consult your database.
By inserting an organization intermediary to execute a query on behalf of the user, you ensure the LLM is neither trained with your organizational data nor leaking data into third-party databases.
You should also safeguard all data by adding privacy and permission around the data itself. Even the most stringent compliance, such as HIPAA, becomes possible when you can control with whom each user shares data, under which circumstances and for how long.
Cryptographically signing and sealing each piece of data in your database also allows users to trust and verify the data. By the time data reaches an LLM, it comes equipped with access control mechanisms and cryptographic verification. An LLM can’t leak or train itself from data that is protected from such possibilities.
AI Agents And Beyond
Innovators are busily working on researching and building three layers of data-centric AI. The layer I just discussed is known as data delivery. It involves helping LLMs retrieve database queries to augment their responses—commonly known as retrieval-augmented generation (RAG). RAG is happening now and will become common in the coming year.
At the more experimental phase is data-centric data management. It answers the question: How do you make multiple LLMs work together to accomplish something requiring multiple steps, such as booking a flight, hotel and activities for a trip? Each AI model does well focusing on a single task, but models need to be linked and talk to each other to accomplish a multifaceted workflow. Microsoft AutoGen is an example of a platform that recently came out to help developers build such workflows.
Further afoot is the ability of AI to be intelligent without needing to fetch new data—delivering its own thoughts, insights and predictions based on looking at existing datasets.
All of these AI evolutions demand that humans be able to safeguard and govern the data that LLMs use. Otherwise, data leakage and hallucinations will remain endemic, and the potential of AI models will be constrained, particularly in the enterprise. It’s time to break the model out of its model-centric box and into a broader swath of governed data.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?