Sitemap

OpenAI AI Agents SDK Guardrails

Well-designed guardrails assist in managing data privacy risks. For example, preventing system prompt leaks…

4 min readApr 29, 2025

--

…or reputational risks. For example, enforcing brand-aligned model behaviour.

Guardrails can be established to address risks already identified for a specific use case, with additional ones layered in as new vulnerabilities are discovered.

Guardrails are a critical component of any LLM-based deployment but should be paired with robust authentication and authorisation protocols, strict access controls, and standard software security measures.

Guardrails function as a layered defence mechanism.

While a single guardrail is unlikely to provide sufficient protection, employing multiple, specialised guardrails together creates more resilient agents.

In the diagram below, LLM-based guardrails, rules-based guardrails such as regex, and the OpenAI moderation API are combined to vet user inputs.

Below a breakdown of how OpenAI advise guardrails should be built and maintained.

Ho To Run Your Own Prototype

Create a virtual environment called openai:

python3 -m venv openai

Then activate his virtual environment:

source openai/bin/activate

Then install the the OpenAI Agent SDK

pip install openai-agents

This example uses a basic keyword-based content filter and a predefined list of allowed actions, reflecting the principles outlined by OpenAI in their AI Agent SDK documentation.

The script is designed to be run as a standalone file.

import re
import logging
from typing import List, Optional

# Configure logging for transparency
logging.basicConfig(
filename='agent_actions.log',
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)

class AgentGuardrail:
def __init__(self):
# Define forbidden words for content filtering (example)
self.forbidden_words = ['harm', 'violence', 'illegal', 'dangerous']
# Define allowed actions (predefined action space)
self.allowed_actions = ['send_message', 'fetch_data', 'schedule_event']
# Initialize logger
self.logger = logging.getLogger(__name__)

def validate_input(self, user_input: str) -> bool:
"""
Validate user input by checking for forbidden words.
Returns True if input is safe, False otherwise.
"""
input_lower = user_input.lower()
for word in self.forbidden_words:
if re.search(r'\b' + word + r'\b', input_lower):
self.logger.warning(f"Input rejected: contains forbidden word '{word}'")
return False
self.logger.info("Input validated successfully")
return True

def restrict_action(self, action: str) -> bool:
"""
Check if the requested action is within the allowed action space.
Returns True if action is allowed, False otherwise.
"""
if action in self.allowed_actions:
self.logger.info(f"Action '{action}' is allowed")
return True
self.logger.warning(f"Action '{action}' is not in allowed action space")
return False

def execute_safe_action(self, user_input: str, action: str) -> str:
"""
Process user input and execute action if it passes guardrails.
Returns a response message.
"""
# Step 1: Validate input
if not self.validate_input(user_input):
return "Error: Input contains inappropriate content."

# Step 2: Restrict action
if not self.restrict_action(action):
return f"Error: Action '{action}' is not permitted."

# Step 3: Simulate executing the action
self.logger.info(f"Executing action '{action}' for input: {user_input}")
return f"Success: Action '{action}' executed."

def main():
# Initialize the guardrail system
guardrail = AgentGuardrail()

# Example test cases
test_cases = [
{"input": "Please send a message to the team", "action": "send_message"},
{"input": "Schedule a meeting for tomorrow", "action": "schedule_event"},
{"input": "Perform an illegal action", "action": "hack_system"},
{"input": "Fetch some data", "action": "fetch_data"},
{"input": "Cause harm to the system", "action": "send_message"},
]

# Process each test case
for case in test_cases:
print(f"\nProcessing input: {case['input']}")
print(f"Requested action: {case['action']}")
result = guardrail.execute_safe_action(case['input'], case['action'])
print(f"Result: {result}")

if __name__ == "__main__":
main()

And the output…

 python3 agent_guardrails.py 

Processing input: Please send a message to the team
Requested action: send_message
Result: Success: Action 'send_message' executed.

Processing input: Schedule a meeting for tomorrow
Requested action: schedule_event
Result: Success: Action 'schedule_event' executed.

Processing input: Perform an illegal action
Requested action: hack_system
Result: Error: Input contains inappropriate content.

Processing input: Fetch some data
Requested action: fetch_data
Result: Success: Action 'fetch_data' executed.

Processing input: Cause harm to the system
Requested action: send_message
Result: Error: Input contains inappropriate content.

The Agents SDK elevates guardrails to core components, adopting an optimistic execution model by default.

In this model, the main agent generates outputs proactively while guardrails operate in parallel, raising exceptions if constraints are violated.

Guardrails can be implemented as functions or agents that enforce rules like preventing jailbreaks, validating relevance, filtering keywords, applying blocklists, or classifying safety.

Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. From Language Models, AI Agents to Agentic Applications, Development Frameworks & Data-Centric Productivity Tools, I share insights and ideas on how these technologies are shaping the future.

https://platform.openai.com/docs/guides/agents

--

--

Cobus Greyling
Cobus Greyling

Written by Cobus Greyling

I’m passionate about exploring the intersection of AI & language. www.cobusgreyling.com

No responses yet