OpenAI Response Generation Trained On A Large Corpus of Data

Any form of context in prompt engineering as an invaluable reference in generating an appropriate and accurate response. But how do you accommodate a larger body of contextual data within a OpenAI generative query?

Cobus Greyling
5 min readJan 11, 2023

--

I’m currently the @ . I explore and write about all things at the intersection of AI and language; ranging from , , , Development Frameworks, and more.

Introduction

for Generative LLMs has emerged as a “thing” with users learning that a LLM cannot be directly instructed. But rather through a process of simulation and influence from the user, the LLM is thus engineered to yield the right answer; if you like.

Contextually engineered prompts typically consist of three sections, instruction, context and question.

In a previous I wrote about the importance of context as a reference for generative LLMs and how LLM can be negated by this.

However, one of the challenges which emerged from the previous article was how a larger body of text (corpus) can be used as a contextual reference.

Hence here, I step through the process of getting relevant data from Wikipedia on a particular subject and transforming the data. Creating embeddings, and lastly, how to reference the contextual data in a query.

🌟 Follow me on for Updates on Conversational AI 🙂

Embeddings

The challenge is that as implementations of OpenAI grow in complexity, the user experience jumps very quickly from a no-code environment to a pro-code environment.

In order to search a larger corpus related to the Olympics related information we have been using, text embeddings need to be created.

OpenAI’s text embeddings measure the semantic similarity between strings of text.

Embeddings are relevant whenever any kind of semantic similarity needs to be detected. For instance search, classifications, clustering, etc.

In the words of OpenAI:

An embedding is a vector (list) of floating point numbers. The between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

The process of creating text embeddings includes the data collection, data transformation, saving of data, creating the embeddings with OpenAI and then answering questions based on search results from the embeddings.

🌟 Follow me on for Updates on Conversational AI 🙂

Read more about embeddings here:

Below is a snippet of the relevant

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

And here you can see the file format after the embeddings have been created using the model text-embedding-ada-002.

datafile_path = '/olympics_sections_document_embeddings.csv'

df = pd.read_csv(datafile_path)
df.sample(5)

The process of data and code is an advanced process. Read more about the data aspect and view notebook examples .

The final result, below you see a question asked with the relevant result.

You will find the complete notebook here:

In Conclusion

For semantic search OpenAI has deprecated their document search approach and replaced it with embeddings model.

There has been much talk regarding harnessing LLMs and creating more predictable and reliable results. I have always maintained that the only solution is and creating models for each specific implementation.

From an OpenAI perspective, there is no dashboard or NLU/NLG Design studio, and the data collection, transformation and ingestion requires custom development.

However, to achieve this at scale, a no-code studio approach is required where with weak supervision, unstructured training data can be converted into NLU and NLG Design data.

🌟 Follow me on for Updates on Conversational AI 🙂

I’m currently the @ . I explore and write about all things at the intersection of AI and language; ranging from , , , Development Frameworks, and more.

https://www.linkedin.com/in/cobusgreyling

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language.

No responses yet