Haystack Developed a Reranker Component To Solve LLM Long Context Vulnerability

A recent study found that when LLMs are presented with longer input, LLM performance is best when relevant content is at the start or end of the input context. Performance degrades when relevant information is in the middle of long context…now remedial action can be taken within the document pipeline.

Cobus Greyling
5 min readSep 8

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

I recently wrote about how LLMs use large context windows and how to manage the performance and cost of large context input to LLMs.

Using large context windows in LLM performance leads to a decline in performance.

If the complexity is offloaded to the LLM provider, the application becomes a black-box without control over cost, input/output token usage, model performance, and context.

And, taking a simplistic approach will create technical debt that needs to be addressed later in the application lifecycle. Offloading complexity and data management to the LLM also ties the Generative App closely to a specific LLM.

To be LLM agnostic, Generative Apps can follow a RAG approach. The ideal situation is where the LLM acts as a utility and does not manage data or application complexity. With a RAG implementation, use-cases requiring large context windows can be handled outside the scope of the LLM.

I also emphasised a recent study which found that LLMs perform better when the relevant information is located at the beginning or end of the input context.

However, when relevant context is in the middle of longer context, the retrieval performance is degraded considerably. This is also the case for models specifically designed for long context.

A few days ago Haystack released a component which optimises the layout of selected documents in the LLM context window. The component is a way to work around the problem identified in the paper.

LostInTheMiddleRanker switches up the placing of the best documents at the beginning and end of the context window, making it easier for the LLM’s attention mechanism to access and use them.

Here is a good explanation:

To understand how LostInTheMiddleRanker orders the given documents, imagine a simple example where documents consist of a single digit from 1 to 10 in ascending order. LostInTheMiddleRanker will order these ten documents in the following order: [1 3 5 7 9 10 8 6 4 2].

Source

Below is a complete working example of a document retriever / reader pipeline with the Lost In The Middle Ranker included.

%%bash

pip install --upgrade pip
pip install farm-haystack[colab,elasticsearch]
pip install datasets>=2.6.1

from haystack.telemetry import tutorial_running
tutorial_running(22)

from haystack.document_stores import InMemoryDocumentStore
document_store = InMemoryDocumentStore(use_bm25=True)

from datasets import load_dataset
dataset = load_dataset("bilgeyucel/seven-wonders", split="train")
document_store.write_documents(dataset)


from haystack.nodes import BM25Retriever
retriever = BM25Retriever(document_store=document_store, top_k=2)

from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import BM25Retriever, SentenceTransformersRanker
from haystack import Pipeline

#document_store = InMemoryDocumentStore(use_bm25=True)
retriever = BM25Retriever(document_store)
ranker = SentenceTransformersRanker(model_name_or_path="cross-encoder/ms-marco-MiniLM-L-12-v2")
p = Pipeline()
p.add_node(component=retriever, name="BM25Retriever", inputs=["Query"])
p.add_node(component=ranker, name="Ranker", inputs=["BM25Retriever"])


from haystack.nodes.ranker import LostInTheMiddleRanker

ranker = LostInTheMiddleRanker(
word_count_threshold=1024,
top_k=3,
)


from haystack.nodes import PromptNode, PromptTemplate, AnswerParser

rag_prompt = PromptTemplate(
prompt="""Synthesize a comprehensive answer from the following text for the given question.
Provide a clear and concise response that summarizes the key points and information presented in the text.
Your answer should be in your own words and be no longer than 50 words.
\n\n Related text: {join(documents)} \n\n Question: {query} \n\n Answer:""",
output_parser=AnswerParser(),
)

prompt_node = PromptNode(model_name_or_path="google/flan-t5-large", default_prompt_template=rag_prompt)

from haystack.pipelines import Pipeline

pipe = Pipeline()
pipe.add_node(component=retriever, name="retriever", inputs=["Query"])
pipe.add_node(component=prompt_node, name="prompt_node", inputs=["retriever"])

output = pipe.run(query="What does Rhodes Statue look like?")
print(output["answers"][0].answer)

Why I particularly like this implementation from Haystack, is the fact that it’s a good example of how innovation in the pre-LLM functionality, or the pipeline phase, can remedy inherent vulnerabilities of a LLM.

⭐️ Follow me on LinkedIn for updates on Large Language Models ⭐️

I’m currently the Chief Evangelist @ HumanFirst. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

Cobus Greyling

Chief Evangelist @ HumanFirst. I explore and write about all things at the intersection of AI and language; NLP/NLU/LLM, Chat/Voicebots, CCAI.