OSET Institute

View Original

Who Should Make a Voter AI Chatbot? (Part 4)

After an extended holiday break from my last installment on the question: “Who should make a Chatbot for voters?” — I’m back for the 4th installment in this series. This time, I’m pivoting from the Who question to the How question; and I have definitely pivoted from “Chatbot” to “domain-specific natural language agent” (NLA).

Why the pivot?

One reason is that there are legitimate design issues to highlight, so that no one misguidedly believes that any kind of domain-specific NLA would be a cakewalk. Another reason is that some analysis of the How should also inform the original question of Who; that is, who has the skills to tackle the issues?

Next time, I’ll focus on the requirements for a “safe” elections NLA; what it would look like, how it would work, how it would look, and — often overlooked — who would manage it (and how). So, so many considerations, eh? But better first to deal with the elephant in the room, and next time turn to requirements, then finish by going back to the Formula-1 analogy to talk about the Pit Crew.

The Proverbial Elephant in the Room

Before anyone can make a safe domain-specific NLA, there is a conundrum to be addressed:

  • Current LLMs are fundamentally unsafe for a growing variety of reasons: prone to inaccuracy and “hallucination”; prone to being manipulated, poisoned, and a growing list of other hacks and attacks.

  • Because of their fundamental lack of safety, current LLMs are not suitable as the base model for any NLA that has a low tolerance for inaccuracy, hallucination, etc.

  • An elections-specific NLA would need to be very low-tolerance, as would many other domain-specific NLAs.

  • Yet, any low-tolerance domain-specific NLA requires a base model.

  • Constructing a new base model is wildly expensive (in the millions of dollars) — an undertaking available only to a handful of companies.

  • So, then, what base model should be used instead of current LLMs?

An obvious-sounding answer is: a “Small Language Model” which is a rather contradiction in terms. Any useful base model needs to be quite large; that is, trained on a large amount of human-created text, to build the large base of data needed to consume and generate natural language that is comprehensible. [1] Perhaps its better to observe that “Small Language Model” is a misnomer, because any base model needs to be sufficiently large — but not gargantuan. I like to characterize current “Large Language Models” as actually “GIGA-LMs” — GIgantic GArgantuan language models. 🤓

Baby Elephants?

By contrast, I characterize “Small Language Models” as actually “RISC-LMs” — RIght Sized Complex Language Models. They’re plenty large, but only about as large (in terms of training data among other factors) as they need to be in order to “talk human-like” albeit with limitations compared to GIGA-LMs (e.g., they provide responses with just the facts, but not expressed as a sonnet in the manner of Petrarch. 🤣)

So I will be a curmudgeon and say “SLLM” for smaller (but still quite large) language-model and refer to it as “SLLM”. However, questions remain:

  • Are SLLMs real? Yes.

  • Have they been proven effective in terms of basic language capability? Yes.

  • Have they been proven safe? NO.

There are open SLLMs like Llama’s small version. There are proprietary SLLMs, like Google’s Gemini Nano. There are extremely opaque custom SLLMs too, with some pretty big claims, like the company c3.ai, which claims it can build a custom base model and AI system for you, and furthermore, that it will be completely accurate and won’t hallucinate — at a cost-to-use of about a quarter million dollars a year. 🙄

But is there evidence any of these is devoid of the ills of GIGA-LMs? NO.

Just using a SLLM by itself doesn’t get you to a safe NLA.

A Safer Start?

Rather, I expect that one or more SLLMs could be candidates for a better starting point for a low-tolerance NLA. However, the use of some current — and likely some new — safety mechanisms would still be required. I’m going to dive down a bit, but hope you’ll indulge me in the next 4 paragrahs.

It’s been amply demonstrated that GIGA-LMs are unsafe at any speed (i.e., the “designed in dangers of LLMs”), such that the use of safety techniques end up creating band-aids that can be bypassed. Yet, there remains hope that some safety techniques can be effective with SLLMs. That’s especially promising for domain-specific NLAs where limited “just the facts” responses are fine, and fancy entertaining discourse is not only unnecessary, but can actually detract from trustworthiness.

NOTE: That said, in our research work here we understand the power and potential of conversational-AI to catalyze a trusted relationship between the user and the machine. That’s important to establish an NLA as trustworthy; the key is avoiding frolic and detour in the prompt—response exchange. Entertaining discourse is not the objective; a polite, engaging, and personalized response is.

Various other experiences with LLMs are relevant too. It’s also clear that adding new training data to an LLM might make it conversant on topics that it wasn’t before, but can also make the model larger and less safe. In contrast, a domain specific NLA needs to be able to derive responses from a corpus of authoritative information, but not to have that information be part of the model. Furthermore, extended training and testing can be expensive, so not scalable to several organizations (other than titans) building NLAs.

Another related failure is an LLM technique where an LLMs usage actually extends the LLM itself, a bit like (though it’s a huge misnomer) learning from its own experience. When the future behavior of an LLM can be manipulated by its present usage, the LLM is vulnerable to becoming even less safe over time. In contrast, a low-tolerance, domain-specific NLA must respond deterministically (though not identically) with responses based on a fixed corpus of authoritative information, which does not change on the fly.

NOTE: However, periodic refreshes of the corpus may be required, meaning that a rebuild and redeploy of a refreshed NLA needs to be feasible.

I’m sure that this list of how-to-do-it-differently items could be extended, but the point is that there is much to be learned, and much of the learning is from doing. So, no time like the present to start building such NLAs with SLLMs, and determining the limits of their safety, and how to respond to those limits.

To be continued

______________________
[1] In the TL;DR department, some will challenge me by asking “Why is that the case?” Fair enough. Here is my take…

  • The statement is empirical. The smallest functional LLM about which there is public info (AFAICT) is Llama small and it’s still huge. This is similarly true for Google’s Gemini Nano. Empirically, you don't get viable human-like discourse with only a small amount of training data. I suppose somebody somewhere is trying to find some point on the spectrum where the training data is small, hence the runtime engine is small, and the discourse capability is barely good enough. Yet, it would still be very expensive to do, and the handful of organizations who could do so, I doubt would bother.

  • Now to give some context to the reader on size, for whatever it matters, there are quite a few models that are smaller than 15 million parameters, those are usually for single-purpose systems (e.g., Robotics). For instance, someone here tells me that Microsoft PACT may be the smallest LLM at 30M parameters HuggingFace has something like 130,000 models of different sizes.

  • And for the less acquainted with parameters, in machine learning, a model learns from data by adjusting its parameters. Parameters (params) are the aspects of the model that are learned from the training data, and they define the transformation between the model’s input and output. In Generative AI models, these parameters are often weights in neural networks that the model adjusts during training. They are critical in helping the model make predictions or decisions based on input data.

  • By comparison, the smaller LLMs most referenced are Flan (11 billion params), Granite (13B), Llama2 (70B ) but it really doesn’t matter. Whether it is millions or billions, it is still very large in terms of the training data itself (not just the number of params). The idea that smaller is less hallucinatory is just speculation at this point. And it’s still well accepted that even if you follow the fools-gold path of taking an existing model and doing supplementary training on your domain-specific stuff, then once the world changes and some part of your stuff becomes false, you cannot take it out of your model. You have to go back to the base model and refactor with a new set of domain training data, and start from scratch. 🙄