Foundation Models and Generative AI

Rahul Agarwal
10 min readJun 15, 2023
Generated image by author using stable-diffusion
Generated image by author using stable-diffusion

In the last few weeks, I have created some simple applications for a custom data, chained chatbot using OpenAI and also deployed some local models for chat and image generation. Taking a step back, this week is my attempt to understand the bigger picture when it comes to the recent AI tools. Here are some of my thoughts on the bigger implications and applications for this emerging trend.

Foundation Models

ChatGPT suddenly showed up and became the most popular app. But trying to learn about it shows that it’s been in the making for many years and there are certain novel aspects about such AI models that we must understand to fully take advantage of their capabilities. ChatGPT is specifically using a transformer architecture.

Transformer is a popular neural network architecture used in natural language processing (NLP) introduced in 2017. It uses self-attention to capture dependencies between words in a sequence, making it effective for tasks like machine translation, summarization, classification etc. The model consists of an encoder and decoder, with the encoder transforming input into a context vector, and the decoder generating the output. The Transformer’s strength lies in its ability to capture long-range dependencies and contextual information, leading to significant advancements in NLP tasks like translation and text generation.

But that is not the only type, and for a generalized name, Stanford HAI coined the term Foundation Models to describe such general-purpose models. Such foundation models are no longer specific, and task oriented as in the past, but rather they are general purpose and capable of performing multiple tasks. This concept is known as “homogenization.” Additionally, since the model has trained partially by itself, all its capabilities need to be discovered and are even unknown to the model creators. This concept is known as “emergence.” They don’t need to be limited to text processing and also apply to images, audio, video and any other type of data.

The models are trained on very massive amounts of data. This requires so much $ cost for both compute and data gathering that very few organizations can actually afford to do it, leading to control by a few. I will talk about dataset biases, copyright and licensing issues towards the end. However, the threshold of access, both in the terms of cost and ease of use has been lowered significantly since prior generation of tools. Within a matter of days compelling apps can be demonstrated!

For testing purposes, it is possible to deploy locally. For commercial deployments the feasible option maybe OpenAI and their APIs (and also Azure OpenAI). AWS has announced Bedrock but that is not GA yet. Similarly, GCP has a waitlist as well. You can deploy your own local models with appropriately powered hosts as well, but I don’t have any production experience I would be curious to learn what types of metrics are useful in operating such foundation models.

Large Language Models (LLMs)

LLM are the language models. Based on my reading of articles from Andrew Ng and Andrej Karpathy, the intuition I would suggest is to think of these LLMs as “reasoning engines.” Like us, they have a small working memory. This memory is measured in terms of tokens and known as the context. In a particular interaction there is only a small amount of information that can be processed and reasoned about. However similar to how you may refer to a reference manual, or your notes, a book or other external sources of information these models can also do the same. The LLM helps to understand the request and produce an action we can apply. While right now the models limit answers to when the last data they trained on, I would expect in future, the knowledge of the world to be externalized. This process of calling external tools or even the same model again before forming the final response is known as chaining. Creating chains is very important and there are some examples later.

Some categories and examples of models can be summarized as follows:

  1. Base Model: Trained on anything from the internet with very large amounts of data. Examples: GPT, LLaMA, PaLM, Falcon
  2. Supervised Fine Tuning (SFT) Model: On top of base model, add a lot of human curated prompts and responses. Examples: Vicuna, Falcon
  3. Reinforcement Learning from Human Feedback (RLHF) Model: On top of SFT model add lots of human written, well explained prompts, and also human ranking/preferences of prompts and responses. Examples: ChatGPT, Claude, Bard

Image, Audio, Video and other models

Image generation models like Stable Diffusion, Midjourney, DALL-E and even Adobe Firefly integrations into Photoshop are now common and accepted. Firefly is very easy and simple to use.

Generated with Adobe Firefly (A heart image that is created using random colorful images of cars, animals, buildings, computers from around the world)

For DALL-E call OpenAI API with your API key:

POST https://api.openai.com/v1/images/generations

{
"prompt": "A heart image that is created using random colorful images of cars, animals, buildings, computers from around the world",
"n": 1,
"size": "512x512"
}
Generated with OpenAI DALL-E
Generated with OpenAI DALL-E

There are also code generation models such Amazon CodeWhisperer and Github Copilot and OpenAI Codex. But there are so many more and HuggingFace is the place to look.

Prompt Engineering

Ask no questions, and you’ll be told no lies

Now that we have so many models, how to use them? This is where the fascinating new concept of prompt engineering comes in. The prompt is how a request is provided to the model. Calling it “engineering” feels like taking it a bit too far but essentially to get high quality responses from a model and discover the emergent capabilities what are the best practices?

For some context, let me highlight two important things I have learnt so far. First, I am used to complete control. With the imperative approach, clear instructions are provided that always result in a deterministic response. I have debuggers and can walk precisely through the code path taken in a scenario. Yes, there are hard to reproduce cases and data/environment specific issues but as a developer I can know the expected output for a given input. This fundamental premise gets challenged when dealing with foundation models. Explaining the answer is not simple, and further there are controls that produce random responses (known as temperature).

Temperature refers to a parameter controlling response randomness. Higher values increase diversity, while lower values produce more focused answers. The term is borrowed from statistical physics, representing energy levels in a system. It doesn’t impact model knowledge, only the output’s variation.

There is another parameter called top_p. I have not used that so far but it also provides variability and its suggested to use only one. Models may hallucinate as well, which is a nice way to saying that the response was pulled out of thin air.

Hallucination refers to generating information that is fictional. It can range from minor inaccuracies to significant and complete fabrications.

The second difference is that instead of precise syntax when writing code, the prompt is now in natural language and subject to both human and machine interpretation. For a simple example, x==1 in a conditional is easy to interpret but in natural language it is not so simple:

The value of x is 1

x takes on the value of 1

When evaluating the value of x, it is determined that x is equal to 1

If we compare the value of x with the number 1, we find that x is indeed equal to 1

After performing the necessary calculations, it becomes evident that x holds the value of 1

Based on examples I have come across, this makes the prompts very verbose and hard to follow. Also today I am able to understand code written by colleagues around the world — what will happen to prompt writing with cultural and global influences thrown in? What if your code comments/javadocs became the code that was executed? How do you debug and fix issues? I do not have answers yet.

Thought-Action-Observation

One recommended way to get better responses is something called Chain of Thought reasoning. Intuitively, the approach is similar to how a human would break down a question into smaller parts, reason about one part and based on the outcome decide what to do next. This creates the following journey given a query to answer:

  • Thought — formulate the task that needs to be done.
  • Action — perform the task.
  • Observation — evaluate the outcome. This may now feed into the next thought.
Thought-Action-Observation loop in the ReAct paradigm
Thought-Action-Observation loop in the ReAct paradigm

Example Prompt 1 — Some typical examples around chaining — What other plants can grow well where grapes thrive?

Thought1 — what conditions do grapes thrive in?
Action1 — find the answer
Observation1 — I have the soil/climate/location data. This is not the final answer.

Thought2 — what else grows well in the given soil/climate/location?
Action2 — find the answer
Observation2 — I am done, formulate a response

Example Prompt 2 — Math riddles are also popular examples — which weighs more, 1lb or 2 8oz bars?

Thought1— weight units are inconsistent so let us convert them
Action1 — convert to common units
Observation1 — I have 16oz and 2*8oz

Thought2 — which one weighs more?
Action2 — compare
Observation2 — formulate response

You will notice this relies on “working memory” or context and since this is limited for longer and more complex queries you may not get the correct response. So as part of the prompt we also need to provide some instructions and details about what we are looking for. The prompt is not only providing the query but also providing guidance to the model. To that end here are some tips:

  • Suggest a persona — you are an assistant that replies politely
  • Ask for a correct solution — to minimize false answers
  • Ask to list the steps — to give the model time to actually create the steps and explicitly list them
  • Ask to use external tools — for tasks model is not good at explicitly ask, for example to use a calculator. This can be plugins in ChatGPT or other ways you maybe chaining (see my example on using your custom data with LangChain and LlamaIndex)
  • Ask for specific output types — this can be JSON or other easy to parse response for your chain
  • Provide examples — this is also called one-shot or few-shot learning. This technique aims to train models with limited labeled data. For example, specific examples of translation or how certain items should be categorized.

Chatbot Example

For the obvious chatbot example, it actually needs to be a chain as depicted here.

Chatbot Chains
Chatbot Chains
  • Input moderation is necessary to ensure only appropriate content is processed including preventing prompt injection (as the name suggests, similar to SQL injection).
  • Providing high quality response may require multiple models as well as external tools so all those steps must be orchestrated.
  • Output moderation is necessary to only return appropriate content.

Thoughts and Unanswered Questions

Application testing

How do you test your fancy new chatbot for all aspects of the test pyramid? Do you have a regression? Testing frameworks are still on the drawing board and the model creators only benchmark their models so that is not necessarily helpful for app developers. Also since the responses are not deterministic the typical assert mechanism is not effective. One solution I have come across is to use the model itself to validate and categorize the answers using some rubric and grade the response.

Biases

Dataset biases are present and a well-known concern. These get amplified and reproduced in the generated content. As with the real world, acknowledging bias and continually striving to improve is probably the best we can do.

Licensing and Copyright

Is it ok to train models with copyrighted content or with various licensing rules? Who owns the generated content and what license applies to it? There is no clear precedent at the moment and depending on risk profile, enterprises are using these new tools to varying degree. Checking your company’s internal guidance would be best.

Private Use

Privacy, security and compliance are non-negotiable aspects for enterprise deployments. Azure provides an excellent guide to all aspects and their supported capabilities. Based on this, the top things to ensure are kept private to you:

  • Your training data
  • Any custom/SFT models you create
  • Prompts and generated content
  • Logs and other observability data

Further if any moderation/review is performed then it must be by approved persons only.

Conclusion

There is unfortunately a lot of fear mongering around this AI hype cycle. Based on what I have learnt, there are new tools we must all learn and adopt, and no it will not end the world. As with anything else this new technology can certainly be misused.

Finally for some wishful thinking, depending on the question asked, often ChatGPT response includes things I already know, and it is just unnecessary information for me. Somehow my existing knowledge needs to be made available so that responses actually inform me about only what I am looking for!

Some quoted parts are verbatim from ChatGPT. There is also some other help and advice in other places if you can spot it!

If these topics interest you, then reach out to me and I will appreciate any feedback. If you would like to work on such problems, you will generally find open roles as well! Please refer to LinkedIn.

--

--