ChainForge Is A Prompt Engineering GUI Toolkit
ChainForge enables the building of evaluation logic to measure prompt performance, LLM Drift and model robustness.
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.
ChainForge is a GUI to build evaluation logic to measure model selection, prompt templating and perform generation auditing. ChainForge can be installed locally or can be run from a chrome browser.
A number of additions have been made to ChainForge; these include (as seen below) a chat turn nodes. Chat nodes are in step with the notion from OpenAI to deprecate completion and insertion modes and focus on chat modes.
With the chat turn nodes a conversation can created while passing the context from node to node. Hence a conversational UI created via prompt chaining can be simulated.
Multiple conversations can be run in parallel across different LLMs.
Chat messages can be templated, and the underlying LLM can be updated and changed along the way for each node.
Chat nodes are important to generation auditing of conversational interfaces. Each node can be inspected to detect for Prompt Drift, LLM Drift, etc.
The image below shows how an expected or ground-truth response can be defined via tabular data input.
Below a text fields node is defined with the seven continents. A prompt is premised on the text fields; followed by chat turn nodes. In the chat turn nodes, the previously used LLMs can be used, or a new LLM can be defined.
For each chat turn node, an inspect node can be defined to view the LLM responses.
The response selector has an option for grouped lists or tables, below you see the output for each of the models referenced.
Consider the LLM Scorer below, the LLM Scorer uses a single model to score other LLM responses by making use of a scoring prompt in which you must define how the LLM must perform the scoring.
In this case, the LLM Scorer prompt is:
Respond with ‘true’ if the text is positive, and respond with ‘false’ if the text is negative.
⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️
I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.