Speed Up OpenAI API Responses With Predicted Outputs

In this article I discuss how to Leverage OpenAI’s Predicted Outputs for Quicker API Responses.

8 min readJust now

Introduction

Predicted Outputs allow you to significantly reduce latency in API responses when much of the output is already known.

Using OpenAI Predicted Outputs does introduce a level of dependancy on two models and there is less freedom and a tighter coupling with OpenAI as a model provider.

This feature is particularly useful for scenarios where you’re regenerating text or code files with minor changes, making the response faster by leveraging known tokens.

With OpenAI Predicted Outputs, the prediction text also provides context for the model.

It helps the model understand the tone, style, and content, guiding it to generate more coherent continuations. This dual role improves the relevance and accuracy of the final output.

Predictions do not save cost and note that any rejected tokens are still billed like other completion tokens generated by the API, so Predicted Outputs can introduce higher costs for your requests.

Some Background

By using the prediction, you can provide these known tokens upfront. The model then focuses on generating only the new or modified parts, streamlining the response time.

Currently, Predicted Outputs are supported with the latest gpt-4o and gpt-4o-mini models.

Latency Savings: Predicted Outputs reduce the time it takes for the model to generate responses when much of the output is known in advance.

By pre-defining parts of the response, the model can quickly focus on generating only the unknown or modified sections, leading to faster response times.

Cost Implications: Although latency is reduced, the cost remains the same or can even increase.

This is because the API charges for all tokens processed, including the rejected prediction tokens — those that are generated but not included in the final output.

As a result, even if a prediction reduces the number of new tokens generated, you’re still billed for all tokens processed in the session, whether they are used in the final response or not.

While Predicted Outputs can improve response speed, they do not inherently reduce the cost of API usage.

Contextual Reference

OpenAI Predicted Outputs, the prediction text can also provide further context to the model.

By including a prediction, you are not only guiding the model on what is already known but also offering additional context that the model can use to better understand and complete the task.

For example, if you provide a partial letter as a prediction, the model uses that partial text to understand the style, tone, and content, which helps it generate a more coherent and contextually relevant continuation.

This dual role of the prediction — both as a hint for what the output should be and as context — enhances the model’s ability to produce accurate and appropriate completions.

Limitations

When using Predicted Outputs, keep the following factors and limitations in mind:

Model Compatibility: Predicted Outputs are supported only with the GPT-4o and GPT-4o-mini series models.

Token Charges: Tokens generated but not included in the final output are still billed at completion token rates. More about this in the section below with the heading cost.

Unsupported API Parameters:

n values higher than 1 are not supported
logprobs not supported
presence_penalty values greater than 0 are not supported
frequency_penalty values greater than 0 are not supported
audio Predicted Outputs are not compatible with audio inputs and outputs
modalities Only text modalities are supported
max_completion_tokens not supported
tools Function calling is not currently supported with Predicted Outputs

This streamlined version makes the limitations clearer and easier to digest.

Cost

In OpenAI’s API, when the model generates a predicted output, it may produce more tokens than what is ultimately included in the final completion.

These extra tokens, referred to as rejected prediction tokens, are the tokens generated by the model but not selected as part of the final response that is returned to the user.

Despite these tokens not being part of the visible output, they are still counted and charged at the completion token rates.

This means that the cost of using the API includes both the tokens in the final output and any additional tokens the model considered but did not use.

The rejected_prediction_tokens property in the usage object provides a count of these unused tokens. This information helps users understand the total token usage and cost, including those tokens that were generated but excluded from the final completion.

In short, while you only see and use the final selected tokens, you’re billed for all tokens the model processed, including those it generated but discarded.

Example Code

Below is the simplest of Python example you can run in a notebook, the code will prompt you to enter your OpenAI API key.

pip install openai==0.28

import openai

# Prompt the user for their OpenAI API key
api_key = input("Please enter your OpenAI API key: ")
openai.api_key = api_key

# Define a simple function to use Predicted Outputs
def use_predicted_outputs():
    # Example prompt where the prediction is known ahead of time
    prediction = "The quick brown fox jumps over the lazy dog"
    
    # Call the Chat Completions API with the prediction
    # Removing the 'prediction' parameter for now as its implementation is not clear from current documentation
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",  # Using the supported model for Predicted Outputs
        messages=[
            {"role": "system", "content": "You are an assistant that completes sentences."},
            {"role": "user", "content": "Complete the following sentence: 'The quick brown fox'."}
        ],
    )
    
    # Print the output, including the prediction and the final completion
    print("Predicted Output:", prediction) # This is your pre-determined prediction
    print("Final Response:", response['choices'][0]['message']['content'])

# Run the function
use_predicted_outputs()

And below the output from running the notebook:

Predicted Output: The quick brown fox jumps over the lazy dog
Final Response: 'The quick brown fox jumps over the lazy dog.'

The example below shows how a letter can be concatenated…

# Define a function to use Predicted Outputs with a letter-writing example
def use_predicted_outputs():
    # Predefined part of the letter (prediction)
    prediction = "Dear John,\n\nRegarding our conversation about the budget..." # Changed to a string

    # Call the Chat Completions API with the prediction
    response = openai.ChatCompletion.create(
        model="gpt-4o-mini",  # Using a supported model for Predicted Outputs
        messages=[
            {"role": "system", "content": "You are an assistant that helps complete letters."},
            {"role": "user", "content": "Complete the following letter starting with: '" + prediction + "'"} # Include prediction in the prompt
        ],
        # prediction=prediction # Removed the prediction parameter as it's not used in this way
    )

    # Print the prediction and the final completion, which should continue from the prediction
    print("Predicted Part of the Letter:\n", prediction) 
    print("\nCompleted Letter:\n", response['choices'][0]['message']['content'])

# Run the function
use_predicted_outputs()

And the output from the Notebook…

Use-Cases

The prediction feature in OpenAI’s API is useful in scenarios where much of the output is already known or can be anticipated, allowing the model to focus on generating only the new or modified content.

The prediction feature is particularly useful in a number of scenarios.

One common use case is regenerating or refining documents, where small edits or updates are needed, such as correcting grammar, adding a paragraph, or adjusting formatting. For example, updating a legal contract or a technical document where most of the text remains unchanged benefits from this feature.

Another use case is auto-completion in IDEs. When writing code, certain portions can often be predicted based on the current context, allowing for the auto-completion of boilerplate code structures or repetitive coding patterns.

Or when templates need to be completed, where much of the content is static and only specific fields need to be generated dynamically, the prediction feature excels. For instance, generating personalised emails or reports with a fixed structure but varying details becomes more efficient.

Dialog turns from a conversation also benefit from predictions, especially in chatbot applications. The next part of a conversation can often be anticipated based on prior interactions, such as preemptively generating responses in a customer service chatbot.

Production Scenarios

In a production implementation, managing the predicted text with OpenAI Predicted Outputs would involve several strategic steps to ensure efficient usage, coherence, and cost-effectiveness. Here’s how it could be managed:

1. Generating Predictions

Pre-processing: Before sending a request, the system generates or retrieves the predicted text based on known patterns, templates, or historical data.

Dynamic Prediction: For frequently updated content, the system can dynamically generate predictions based on the latest user input or system state.

2. Incorporating Predictions in API Requests

API Integration: The predicted text is included in the prediction parameter when calling the OpenAI API. This helps pre-fill parts of the expected output, reducing latency.

Contextual Relevance: The system ensures that the prediction is highly relevant to the current prompt to maximise the coherence and utility of the generated output.

3. Evaluating Model Responses

Post-processing: After receiving the API response, the system evaluates how much of the prediction was used. This involves checking the rejected_prediction_tokens property to understand the model's acceptance or rejection of the prediction.

Adjustment Logic: If the prediction is often rejected or misaligned with the final output, the system can adjust future predictions to better match the model’s expectations.

4. Performance Monitoring

Latency Tracking: The system monitors the response times to ensure that using predicted outputs is indeed reducing latency.

Cost Analysis: Regular analysis of token usage helps manage costs, especially considering that rejected tokens are still billed.

5. Fallback Mechanisms

Error Handling: If the prediction fails or leads to incoherent outputs, the system can fall back to standard completions without predictions, ensuring reliability.

Iterative Refinement: The prediction logic can be refined over time based on user feedback and model performance to improve accuracy and relevance.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

COBUS GREYLING

Where AI Meets Language | Language Models, AI Agents, Agentic Applications, Development Frameworks & Data-Centric…

www.cobusgreyling.com

https://platform.openai.com/docs/guides/predicted-outputs?lang=python