Language Model Cascading & Probabilistic Programming Language

The term Language Model Cascading (LMC) was coined in July 2022, which seems like a lifetime ago considering the speed at which the LLM narrative arc develops…

7 min readSep 14, 2023

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

Introduction

This is one of the most interesting papers I have read in a long time, the fact that it is just over a year old make it seem quite recent. However, reading the paper, one realises the speed at which technology has progressed in just over twelve months.

Considering the paper, the Scratchpad and Chain-Of-Thought approach are two of the most recognised approaches in recent times. The Tool Use description is very close to what we know today as autonomous agents. Selection-Inference is a basic description of RAG. And Verifiers really reminds of the recent test framework developed by Ragas.

In this study of July 2022 the phrase cascading is used as an analogous term for chaining. In later studies cascading has adopted a different meaning. Read more about it here.

The way descriptive terms are used in this paper, as opposed to the now well-known terms is interesting, and it is quite insightful how these ideas developed into real implementations.

It’s interesting how the early vulnerabilities and opportunities were identified and developments on a few fronts, brought solutions to production.

Developments took place in the area of LLMs, Prompt Engineering Techniques, Prompt Injection/enrichment at inference, Autonomous Agents and Prompt Chaining IDEs.

Back To The Paper…

The term LMC was developed to act as a reference framework for computer programs that chain together LLM interactions, a probabilistic programming language (PPL). A framework for creating computer programs that chain together language model interactions.

Even-though this study is old in relative terms, valuable principles can be gleaned from it, and it acts as a history lesson in how we find ourselves with the current tools at our disposal.

Scratchpads & CoT

I like the way this paper words CoT and Scratchpads: Inference can be implemented by ancestral sampling. Both Scratchpads and CoT shows intermediate computation.

Tool Use

Tool use reminds strongly of what we now know as autonomous agents.

The other applications discussed so far involve iterating a language model within a controlled flow, without external feedback. The paper argues that there are many tasks of interest in which a model is interacting with external systems.

Examples of external tools are a calculator to solve math problems. Or a tool which can perform web browsing and perform QnA.

Verifiers

An intuitive way to improve model performance is to train it to judge whether an answer and thought are likely to be “valid”. Cobbe et al. (2021) propose using a separate model as a verifier to filter solutions to reasoning tasks.

The Verifiers approach can be described as where a verification label (V) is added to show whether the thought (T) is a valid form of reasoning for deriving (A) from (Q) and (A) is the correct answer.

D = {(Q, T, A, V}

We can create a “labeled” training set of the form D, where we add a “verification” label V, representing whether the thought T is a valid form of reasoning for deriving A from Q, and A is the correct answer.

The verifiers may be used to reject incorrect examples in ancestral sampling, and the thought generator may itself be conditioned on the verifiers being correct by fine-tuning or prompting.

This approach reminds quite a bit of the Ragas approach of having a ground truthed reference.

Practically it will work best if only a sample of data is submitted to the Verifiers process for testing.

Selection-Inference

In this paper Selection-Inference is considered as a chain. Considering the image below, the S is the selected subset of facts and I is an inference driven by this subset.

This approach is very reminiscent of the retrieval-augmented generation (RAG) as we know and understand it today.

Probabilistic Programming Language (PPL)

Probabilistic Programs (PPLs) are constituted by two constructs:

PPLs have the ability to draw values at random from distributions.
The ability to condition values of variables in a program via observations.

What makes Probabilistic Programs different from traditional programming languages, is the fact that they have the ability to sample from distributions and observe variables based on data.

Hence we can make predictions based on certain inputs and/or outputs of a program. For example, we can sample prompts conditioned on the output of a verifier or external tool.

Chaining was thought of as unique in providing a probabilistic programming framework over the space of strings.

Language models take in and emit text written in a human language. Chaining allows for various kinds of conditional and unconditional inference over this space.

Below an example of Google Research’s cascades implementation:


!pip install cascades
!pip install duckduckgo-search  # or your preferred web query api

import os
import openai
api_key = None # @param
if api_key:
  os.environ['OPENAI_API_KEY'] = api_key

import cascades as cc


# Check that we can sample from GPT.
dist = cc.GPT(prompt='Probabilistic programming is ',
       # engine='davinci-codex', 
       temperature=0.7, 
       stop=('\n',))
x = dist.sample(rng=0)
x

from duckduckgo_search import ddg


keywords = 'How many legs does a rabbit have?'
results = ddg(keywords, region='wt-wt', safesearch='Moderate', time='y', max_results=3)
print(results)

results[0].keys(), results[0]['body']

import functools

@functools.lru_cache(maxsize=1000)
def get_passages(query, num_passages=5, output=None):
  # output: json, csv, print
  res = ddg(keywords=query, max_results=num_passages, output=output) 
  return res


@cc.model
def qa_with_search(question):
  """Answer question."""
  context = get_passages(question, num_passages=1)[0]['body']
  yield cc.log(context, name='context')
  prompt = f"""The answer sheet for the questions is below:

Question: Which planet is the hottest in the solar system?
Context: It has a strong greenhouse effect, similar to the one we experience on Earth. Because of this, Venus is the hottest planet in the solar system. The surface of Venus is approximately 465°C! Fourth from the Sun, after Earth, is Mars.
Answer: Venus

Question: Which country produces the most coffee in the world?
Context: With the rise in popularity of coffee among Europeans, Brazil became the world's largest producer in the 1840s and has been ever since. Some 300,000 coffee farms are spread over the Brazilian landscape.
Answer: Brazil

Question: {question}
Context: {context}
Answer:"""
  answer = yield cc.GPT(prompt=prompt, stop='\n', name='answer')
  return answer.value

@cc.model
def qa(question):
  """Answer question."""
  prompt = f"""Answer the questions below given a document from the web:

Question: What is often seen as the smallest unit of memory?
Answer: kilobyte

Question: Which planet is the hottest in the solar system?
Answer: Venus

Question: Which country produces the most coffee in the world?
Answer: Brazil

Question: {question}
Answer:"""
  answer = yield cc.GPT(prompt=prompt, stop='\n', name='answer')
  return answer.value

%time no_search = qa.sample('Which bones are babies born without?')
no_search

%time with_search = qa_with_search.sample('Which bones are babies born without?')
with_search

def compare(question):
  no_search = qa.sample(question)
  search = qa_with_search.sample(question)
  return no_search, search

compare('Which bone are babies born without')

from concurrent import futures
pool = futures.ThreadPoolExecutor(16)


Q = 'Which bone is a baby born without?'
rs = qa_with_search.sample_parallel(pool, Q, n=4)
rs  # List of running traces.

# show 20 results
rs[0].future.result(20)

[r.return_value for r in rs]

%%time
rs = qa.sample_parallel(pool, Q, n=4)
[r.future.result(20) for r in rs]
print([r.return_value for r in rs])

Source

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

www.humanfirst.ai

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

cobusgreyling.medium.com

Language Model Cascades

The Cascades paper is available on arXiv. Prompted models have demonstrated impressive few-shot learning abilities…

model-cascades.github.io

Language Model Cascades

Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a…

arxiv.org

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating…

arxiv.org

Language Model Cascading & Probabilistic Programming Language

The term Language Model Cascading (LMC) was coined in July 2022, which seems like a lifetime ago considering the speed at which the LLM narrative arc develops…

Introduction

Back To The Paper…

Scratchpads & CoT

Tool Use

Verifiers

Selection-Inference

Probabilistic Programming Language (PPL)

HumanFirst — Design, test and launch custom NLU and prompts

HumanFirst makes sense of unstructured data quickly. Pairing human-in-the-loop and AI-powered features, seamlessly…

Get an email whenever Cobus Greyling publishes.

Get an email whenever Cobus Greyling publishes. By signing up, you will create a Medium account if you don’t already…

Language Model Cascades

The Cascades paper is available on arXiv. Prompted models have demonstrated impressive few-shot learning abilities…

Language Model Cascades

Prompted models have demonstrated impressive few-shot learning abilities. Repeated interactions at test-time with a…

Show Your Work: Scratchpads for Intermediate Computation with Language Models

Large pre-trained language models perform remarkably well on tasks that can be done "in one pass", such as generating…

Written by Cobus Greyling