LLMs, Symbolic Reasoning & Visual Capabilities
Something that fascinated me with the advent of Large Language Models were their ability to perform symbolic reasoning…
The classic capability from early LLMs to be able to parse a detailed textual description of a scene and answering logical questions about that scene.
For instance, a scene can be described in the following way, if the blue box is between the chair and the lamp, and the lamp is to the right of the window, where is the window relative to the box?
Or describing a kitchen in detail, and saying there are two pairs on the counter, five bananas in the fridge, etc.
In the past, symbolic reasoning did not receive the attention it deserved.
And then asking the LLM how many types and fruit is in the kitchen in total.
In text-only models, symbolic reasoning relies on the model’s internalised world model from training data — essentially simulating spatial relationships through linguistic patterns and logical inference.
Multimodal models extend this by using vision encoders to convert images into token embeddings that feed into the same language backbone.
This allows the model to reason in the language space about visual content, much like it did with descriptive text.
The result is that you can describe a scene by uploading an image and pose the same kinds of logical, spatial questions, with the model extracting entities, relations and inferences.
A good example of spatial reasoning is from NVIDIA’s Nemotron Nano 2 VL model where a number of invoices are sent to the model, questions can then be asked like, Sum up all the totals across the receipts or Here are 4 invoices flagged as potential duplicates — are they actually the same document with minor layout differences?
You can find more details on this, in the article below…
Claude Opus 4.1 supports the same approach, with uploading multiple documents like invoices in a single chat or prompt.
Thus allowing you to ask comparative questions such as Compare the totals and line items across these five invoices — what discrepancies do you see
This leverages its multimodal and file-processing capabilities, where it extracts text from PDFs, performs OCR on images if needed and reasons across them for tasks like variance detection or summarisation.
I recently heard of an implementation where this approach was used to help forensic auditors investigate possible breaches.
It is my understanding that for Claude Opus 4.1 the number of documents is capped at up to 20 files per chat with each file limited to 30 MB.
Regardless of format (PDFs, XLSX, TXT etc.).
On the API side, image-specific limits allow up to 100 images per request (5 MB each), but for general file uploads like PDFs, it aligns closer to the 20-file chat limit to manage context length (up to 200K input tokens total).
Apparently the caps ensure reliable performance without overwhelming the model’s 200K token window and you can reference files by name in prompts for targeted queries.
Below I walk you through a practical example of creating a Python based image processing application in a Colab Notebook.
All the images you will need is in this article. If you go to the files option on the left of the Colab window, you can upload the image files.
Do the install…
%pip install anthropic IPythonThis code is the behind-the-scenes prep for building the application that sends images to Claude for questions, like comparing invoices.
import base64
from anthropic import Anthropic
from google.colab import userdata
# Get the API key from Colab secrets
ANTHROPIC_API_KEY = userdata.get('ANTHROPIC_API_KEY')
client = Anthropic(api_key=ANTHROPIC_API_KEY)
MODEL_NAME = "claude-opus-4-1"
def get_base64_encoded_image(image_path):
with open(image_path, "rb") as image_file:
binary_data = image_file.read()
base_64_encoded_data = base64.b64encode(binary_data)
base64_string = base_64_encoded_data.decode("utf-8")
return base64_stringThe image is uploaded…as shown below,
from IPython.display import Image
Image(filename="/content/stack_overflow.png")The image…
The question is sent to the LLM…
message_list = [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": get_base64_encoded_image("/content/stack_overflow.png"),
},
},
{"type": "text", "text": "Transcribe the code in the answer. Only output the code."},
],
}
]
response = client.messages.create(model=MODEL_NAME, max_tokens=2048, messages=message_list)
print(response.content[0].text)And the response…
```python
import os
import base64
image = 'test.jpg'
encoded_string = ""
with open(image, "rb") as image_file:
encoded_string = base64.b64encode(image_file.read())
file = encoded_string
```The example below is that of study notes…
Image(filename="/content/school_notes.png")The actual image…
Asking the question and referencing the image…
message_list = [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/png",
"data": get_base64_encoded_image("/content/school_notes.png"),
},
},
{
"type": "text",
"text": "Transcribe this text. Only output the text and nothing else.",
},
],
}
]
response = client.messages.create(model=MODEL_NAME, max_tokens=2048, messages=message_list)
print(response.content[0].text)And the response…
U6L4 Levels of Cellular Organization
1) Cells group together to make tissue.
2) Tissues group together to make an organ.
3) Organs group together to make an organ system.
4) Organ Systems group together to make an organism.
Organism → a living thing that can carry out life processes by itself.
- Multicellular organisms have specialized cells to perform specific functions.
↳ This makes them more efficient and typically have a longer life span.
Tissue → a group of similar cells that perform a common function.
1) Animals are made of four basic types of tissue.
↳ nervous, epithelial, connective, and muscle
2) Plants have three types of tissue
↳ transport, protective, and groundA more complex image with a print and hand written entries…
Image(filename="/content/vehicle_form.jpg")The image…
The inference query…
message_list = [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": get_base64_encoded_image("/content/vehicle_form.jpg"),
},
},
{"type": "text", "text": "Transcribe this form exactly."},
],
}
]And the transcription…
VEHICLE INCIDENT REPORT FORM
Use this form to report accidents, injuries, medical situations, criminal activities, traffic incidents, or student behavior incidents. If possible, a report should be completed within 24 hours of the event.
Date of Report: 02/29, 2024
PERSON INVOLVED
Full Name: John Doe Address: 123 Main St
Identification: ☒ Driver's License No. 4741921 ☐ Passport No. _____________
☐ Other: ___________________________
Phone: (678)999-8212 E-Mail: john@gmail.com
THE INCIDENT
Date of Incident: 02/29/2024 Time: 9:01 ☒ AM ☐ PM
Location: Corner of 2nd and 3rd
Describe the Incident: Red car t-boned blue car
_________________________________________________________________________
_________________________________________________________________________
INJURIES
Was anyone injured? ☐ Yes ☒ No
If yes, describe the injuries: ________________________________________________
_________________________________________________________________________
_________________________________________________________________________
WITNESSES
Were there witnesses to the incident? ☐ Yes ☒ No
If yes, enter the witnesses' names and contact info: ____________________________
_________________________________________________________________________
_________________________________________________________________________
Page 1 of 2A more complex image…
Image(filename="/content/page.jpg")The query…
message_list = [
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": get_base64_encoded_image("/content/page.jpg"),
},
},
{"type": "text", "text": "Which is the most critical issue for live rep support?"},
],
}
]
response = client.messages.create(model=MODEL_NAME, max_tokens=2048, messages=message_list)
print(response.content[0].text)And the response from the model…
According to the hierarchy pyramid shown in the document,
the most critical issue for live rep support is **Product
Quality/Liability Issues**, which appears at the bottom of
the pyramid (indicating highest criticality/importance).Lastly a more structured image…
Image(filename="/content/org_chart.jpg")The query…
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": "image/jpeg",
"data": get_base64_encoded_image("/content/org_chart.jpg"),
},
},
{
"type": "text",
"text": "Turn this org chart into JSON indicating who reports to who. Only output the JSON and nothing else. But also show the hiracy and relationship between people. ",
},
],
}
]
response = client.messages.create(model=MODEL_NAME, max_tokens=2048, messages=message_list)
print(response.content[0].text)And the response in JSON….
```json
{
"name": "John Smith",
"title": "President",
"reports": [
{
"name": "Susan Jones",
"title": "VP Marketing",
"reports": [
{
"name": "Alice Johnson",
"title": "Manager",
"reports": []
},
{
"name": "Tim Moore",
"title": "Manager",
"reports": []
}
]
},
{
"name": "Rachel Parker",
"title": "VP Sales",
"reports": [
{
"name": "Michael Gross",
"title": "Manager",
"reports": []
},
{
"name": "Kim Dole",
"title": "Manager",
"reports": []
}
]
},
{
"name": "Tom Allen",
"title": "VP Production",
"reports": [
{
"name": "Kathy Roberts",
"title": "Manager",
"reports": []
},
{
"name": "Betsy Foster",
"title": "Manager",
"reports": []
}
]
}
]
}Chief Evangelist @ Kore.ai | I’m passionate about exploring the intersection of AI and language. Language Models, AI Agents, Agentic Apps, Dev Frameworks & Data-Driven Tools shaping tomorrow.
