Getting Beyond AI Hype: What to Consider in Training and Knowledge BasesGetting Beyond AI Hype: What to Consider in Training and Knowledge Bases
The training models in generative AI are only going to be as good as the quality of the complex training models upon which they are instructed. In other words – garbage in, garbage out.
June 29, 2023
You've surely heard that generative AI is revolutionizing how we live and work. You’ve also surely heard that it’s making incredible, inexplicable errors that the AI community calls “hallucinations.” Surely, you might think, this is all hype because both these statements cannot be true. But they are, and why the errors are possible is hidden in the twin concepts of training and knowledge bases.
Generative AI is based on large-language models (LLMs) that create a contextual picture of something, which might be an image, a programming language, or human conversation. To create that contextual picture they need to be trained on actual material of the type they’re expected to generate, i.e. if the model is going to generate legal citations, it needs to be trained on legal texts. Early systems used an assisted or annotated method that involved human participation and review, but today’s systems train from a mass of appropriate material—like the Internet.
Training is incredibly complex, often requiring enormous GPU resources and a lot of time. It involves creating labels for information types, for example parts of speech, and models describing how that information could be given “context”, meaning structured into sentences (for text), music, images, or computer languages. This is what lets an LLM process a query or conversation, or produce an image or music. A new version of a language model (GPT 4 or 5, for example) might well require more time training the new software than the changes to the software itself would require.
The widely used “public” AI models are trained by the company that offers them. Often the Internet or a broad subset of it are used as the training data, but specialized models for image generation, music, financial analysis, and software development usually train on a subset of public data. Training consists of running the information through what’s usually a three-step process: analysis of input, then generation, then discrimination. The first step is an analysis of the query/input to extract data seeds that feed the second or generation phase. The generation phase suggests a series of contexts and data elements, which in then are fed to the third or constraint/discrimination phase. Training has to optimize all of these steps, based on exemplars drawn from the source material and sometimes augmented by human review and intervention. Knowledge, the other data piece of the generative AI puzzle, is the base of reference information generative AI works on. We generate petabytes of data almost hourly, and just as training has to label information, so does knowledge. The number 1,728 means something mathematically (it’s 12 to the third power), in describing volume (the number of cubit inches in a cubic foot), and perhaps the number of sales your company made last week. It would be nice if your generative AI response knew, if that number appeared, what it represented, wouldn’t it? That means generative AI has to work more than just numbers and other data, but with what they mean.
The challenge in knowledge is that process of classifying the information. Structured data, like a company database, is pretty easy to classify because fields are usually labeled and the meaning can be assigned to all data by linking it to the fields themselves. Unstructured data, like free text, images, music, video, and similar stuff, has to be classified by context, which links it back to the training process which is where context analysis has to be based. This linkage often means that the training base and the knowledge base are the same, in which case the model is working on data using insights that the analysis of that same data generated. Circular? Maybe, and that’s one reason why training processes are so critical in preventing inappropriate results. Inappropriate results are the biggest problem users have with generative AI. Of the over 100 users who offered me comments on their use of the technology, all said that they’d experienced hallucinations, and according to their estimates, between a fifth and a third of AI results contained at least one. AI errors or hallucinations are one major cause of this, but training and knowledge also play a major role. Combine the two and your knowledge is dated because training is time-consuming. Separate them and you have to ingest a lot of new material to address queries.
These issues have divided generative AI usage into what we could call “public” and “private” modes. In the public mode, the training base and knowledge base are integrated with the model and often unified. This mode is very useful for things like generating ad copy or writing articles, and this is the way generative AI tools are most often used today. For that set of applications, the Internet usually serves both as the information source for training and the knowledge source for response generation. When you use an online chatbot based on generative AI, it’s based on a public model. Public models can also be “specialized” for a mission by training them on specific data types. For example, there are models to facilitate software development and review, and models that can create artwork, pictures, and compose music, better than the general public models. Because these use a limited training/knowledge base, they also generally have a lower rate of hallucinations, perhaps half that of broad Internet-trained models.
The private mode operates on company data, but it could be trained on public data too. AI models for company financial analysis are already available, for example, and they’re trained on masses of public data but operate on a company’s own business records. This approach is much easier to adopt than one that requires a company to use generative AI software and train the software to their own information, something that takes special skills and a lot of time and compute power. Those costs could be prohibitive.
Despite all the options and progress, generative AI still has an unacceptable error rate for most users and applications, and a lot of the work being done to improve generative AI seems focused on somehow bounding results rather than actually making better choices. Training improvements would seem to offer the best path to bettering the choices. The question is whether improvements in training can be done without impacting the use of generative AI on private data or introducing human judgment. It may well be that until those kinds of questions can be answered, we’ll have positive and negative hype to contend with, and that’s going to create user uncertainty.