OpenAI Response Generation Trained On A Large Corpus of Data

Any form of context in prompt engineering as an invaluable reference in generating an appropriate and accurate response. But how do you accommodate a larger body of contextual data within a OpenAI generative query?

Cobus Greyling
5 min readJan 11, 2023

--

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

Introduction

Prompt Engineering for Generative LLMs has emerged as a “thing” with users learning that a LLM cannot be directly instructed. But rather through a process of simulation and influence from the user, the LLM is thus engineered to yield the right answer; if you like.

Contextually engineered prompts typically consist of three sections, instruction, context and question.

In a previous article I wrote about the importance of context as a reference for generative LLMs and how LLM hallucination can be negated by this.

However, one of the challenges which emerged from the previous article was how a larger body of text (corpus) can be used as a contextual reference.

Hence here, I step through the process of getting relevant data from Wikipedia on a particular subject and transforming the data. Creating embeddings, and lastly, how to reference the contextual data in a query.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

Embeddings

The challenge is that as implementations of OpenAI grow in complexity, the user experience jumps very quickly from a no-code environment to a pro-code environment.

In order to search a larger corpus related to the Olympics related information we have been using, text embeddings need to be created.

OpenAI’s text embeddings measure the semantic similarity between strings of text.

Embeddings are relevant whenever any kind of semantic similarity needs to be detected. For instance search, classifications, clustering, etc.

In the words of OpenAI:

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

The process of creating text embeddings includes the data collection, data transformation, saving of data, creating the embeddings with OpenAI and then answering questions based on search results from the embeddings.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

Read more about embeddings here:

Below is a snippet of the relevant Wikipedia data file

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

And here you can see the file format after the embeddings have been created using the model text-embedding-ada-002.

datafile_path = '/olympics_sections_document_embeddings.csv'

df = pd.read_csv(datafile_path)
df.sample(5)

The process of data manipulation and code is an advanced process. Read more about the data aspect and view notebook examples here.

The final result, below you see a question asked with the relevant result.

You will find the complete notebook here:

In Conclusion

For semantic search OpenAI has deprecated their document search approach and replaced it with embeddings model.

There has been much talk regarding harnessing LLMs and creating more predictable and reliable results. I have always maintained that the only solution is fine-tuning LLMs and creating custom LLM models for each specific implementation.

From an OpenAI perspective, there is no dashboard or NLU/NLG Design studio, and the data collection, transformation and ingestion requires custom development.

However, to achieve this at scale, a no-code studio approach is required where with weak supervision, unstructured training data can be converted into NLU and NLG Design data.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

https://www.linkedin.com/in/cobusgreyling
https://www.linkedin.com/in/cobusgreyling

--

--