OpenAI Response Generation Trained On A Large Corpus of Data

Any form of context in prompt engineering as an invaluable reference in generating an appropriate and accurate response. But how do you accommodate a larger body of contextual data within a OpenAI generative query?

5 min readJan 11, 2023

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

Introduction

Prompt Engineering for Generative LLMs has emerged as a “thing” with users learning that a LLM cannot be directly instructed. But rather through a process of simulation and influence from the user, the LLM is thus engineered to yield the right answer; if you like.

Contextually engineered prompts typically consist of three sections, instruction, context and question.

In a previous article I wrote about the importance of context as a reference for generative LLMs and how LLM hallucination can be negated by this.

However, one of the challenges which emerged from the previous article was how a larger body of text (corpus) can be used as a contextual reference.

Hence here, I step through the process of getting relevant data from Wikipedia on a particular subject and transforming the data. Creating embeddings, and lastly, how to reference the contextual data in a query.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

Embeddings

The challenge is that as implementations of OpenAI grow in complexity, the user experience jumps very quickly from a no-code environment to a pro-code environment.

In order to search a larger corpus related to the Olympics related information we have been using, text embeddings need to be created.

OpenAI’s text embeddings measure the semantic similarity between strings of text.

Embeddings are relevant whenever any kind of semantic similarity needs to be detected. For instance search, classifications, clustering, etc.

In the words of OpenAI:

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness and large distances suggest low relatedness.

The process of creating text embeddings includes the data collection, data transformation, saving of data, creating the embeddings with OpenAI and then answering questions based on search results from the embeddings.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

Read more about embeddings here:

New and Improved Embedding Model

We are excited to announce a new embedding model which is significantly more capable, cost effective, and simpler to…

openai.com

Below is a snippet of the relevant Wikipedia data file…

df = pd.read_csv('https://cdn.openai.com/API/examples/data/olympics_sections_text.csv')
df = df.set_index(["title", "heading"])
print(f"{len(df)} rows in the data.")
df.sample(5)

And here you can see the file format after the embeddings have been created using the model text-embedding-ada-002.

datafile_path = '/olympics_sections_document_embeddings.csv'

df = pd.read_csv(datafile_path)
df.sample(5)

The process of data manipulation and code is an advanced process. Read more about the data aspect and view notebook examples here.

The final result, below you see a question asked with the relevant result.

You will find the complete notebook here:

openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

In Conclusion

For semantic search OpenAI has deprecated their document search approach and replaced it with embeddings model.

There has been much talk regarding harnessing LLMs and creating more predictable and reliable results. I have always maintained that the only solution is fine-tuning LLMs and creating custom LLM models for each specific implementation.

From an OpenAI perspective, there is no dashboard or NLU/NLG Design studio, and the data collection, transformation and ingestion requires custom development.

However, to achieve this at scale, a no-code studio approach is required where with weak supervision, unstructured training data can be converted into NLU and NLG Design data.

🌟 Follow me on LinkedIn for Updates on Conversational AI 🙂

I’m currently the Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces and more.

NLU design tooling

“Conversation Designer, Retail, 10k+ employees The tool that turned conversation designers, into NLU designers” ★★★★★…

www.humanfirst.ai

https://www.linkedin.com/in/cobusgreyling

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

Eliza Language Technology Community — Language Technology: Conversational AI, NLP/NLP, CCAI…

ELIZA — Where language technology enthusiasts unite.

eliza.community

The Cobus Quadrant™ Of NLU Design

NLU design is vital to planning and continuously improving Conversational AI experiences.

cobusgreyling.medium.com

The Cobus Quadrant™ Of Conversation Design Capabilities

∗ This is part one of a two part series, please also take a look part two, the Cobus Quadrant of NLU Design.

cobusgreyling.medium.com

openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Prompt Engineering, Text Generation & Large Language Models

Text Generation Is A Meta Capability Of Large Language Models & Prompt Engineering Is Key To Unlocking It. You cannot…

cobusgreyling.medium.com

OpenAI Response Generation Trained On A Large Corpus of Data

Any form of context in prompt engineering as an invaluable reference in generating an appropriate and accurate response. But how do you accommodate a larger body of contextual data within a OpenAI generative query?

Introduction

Embeddings

New and Improved Embedding Model

We are excited to announce a new embedding model which is significantly more capable, cost effective, and simpler to…

openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

In Conclusion

NLU design tooling

“Conversation Designer, Retail, 10k+ employees The tool that turned conversation designers, into NLU designers” ★★★★★…

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

Eliza Language Technology Community — Language Technology: Conversational AI, NLP/NLP, CCAI…

ELIZA — Where language technology enthusiasts unite.

The Cobus Quadrant™ Of NLU Design

NLU design is vital to planning and continuously improving Conversational AI experiences.

The Cobus Quadrant™ Of Conversation Design Capabilities

∗ This is part one of a two part series, please also take a look part two, the Cobus Quadrant of NLU Design.

openai-cookbook/Question_answering_using_embeddings.ipynb at main · openai/openai-cookbook

You can’t perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

Prompt Engineering, Text Generation & Large Language Models

Text Generation Is A Meta Capability Of Large Language Models & Prompt Engineering Is Key To Unlocking It. You cannot…

Written by Cobus Greyling

No responses yet