Develop Generative Apps Locally

I wanted to create a complete Generative App ecosystem running on a MacBook & this is how I did it.

5 min readFeb 29, 2024

--

Introduction

For me at least, there is a certain allure to running a complete generative development configuration locally on CPU.

The process of achieving this was easier and much less technical than what I initially thought.

Small Language Models (SLMs) can be accessed via HuggingFace, and making use of an inference server like TitanML’s TakeOff Server the SLM can be served locally.

The Inference server pulls a model from the HuggingFace Hub, and defining the model to use is very easy with the TitanML inference server. The inference server goes off, finds the model and downloads it locally.

The Inference Server serves the language model, makes APIs available, offers a playground and performs quantisation.

Small Language Model

In a previous article on quantisation, I made use of TinyLlama as the SLM. TinyLlama is a compact 1.1B Small Language Model (SLM) pre-trained on around 1 trillion tokens for approximately 3 epochs.

Despite its relatively small size, TinyLlama demonstrates remarkable performance in a series of downstream tasks. It significantly outperforms existing open-source language models with comparable sizes.

In this prototype, I used Meta AI’s opt-125m Small Language Model. The pretrained only model can be used for prompting for evaluation of downstream tasks as well as text generation. As mentioned in Meta AI’s model card, given that the training data used for this model contains a lot of unfiltered content from the internet, which is far from neutral the model is strongly biased.

Local Inference Server

The TitanML Inference Server is managed via Docker, and can easily be setup with only two commands; once Docker is installed.

docker pull tytn/takeoff-pro:0.11.0-cpu

and

docker run -it \
-e TAKEOFF_MODEL_NAME=TinyLlama/TinyLlama-1.1B-Chat-v1.0 \
-e TAKEOFF_DEVICE=cpu \
-e LICENSE_KEY=<INSERT_LICENSE_KEY_HERE> \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=128 \
-p 3000:3000 \
-p 3001:3001 \
-v ~/.model_cache:/code/models \
tytn/takeoff-pro:0.11.0-cpu

Below is a screenshot of the inference server starting, notice the model detail being shown…

And by accessing the url http://localhost:3000/#/playground the playground can be accessed, all offline and running locally.

Below is a link to the model I made use of in HuggingFace. I was surprised by the ease with which models can be referenced, downloaded and run all very seamlessly managed by the inference server.

Notice how the model name is defined within the script to start and run the inference server instances.

docker run -it \
-e TAKEOFF_MODEL_NAME=facebook/opt-125m \
-e TAKEOFF_DEVICE=cpu \
-e LICENSE_KEY=KGNHZ-TOILC-HVUWI-PVZIS \
-e TAKEOFF_MAX_SEQUENCE_LENGTH=128 \
-p 3000:3000 \
-p 3001:3001 \
-v ~/.model_cache:/code/models \
tytn/takeoff-pro:0.11.0-cpu

Notebook

As seen in the image below, the Jupyter notebook instance was also running locally, where LangChain was installed with the TitanML libraries.

LangChain Applications

Here is the most simple version of a LangChain application posing a question to the LLM.

pip install langchain-community

from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.prompts import PromptTemplate
from langchain_community.llms import TitanTakeoffPro

# Example 1: Basic use
llm = TitanTakeoffPro()
output = llm("What is the weather in London in August?")
print(output)

And two different questions can be asked simultaneously.

llm = TitanTakeoffPro()
rich_output = llm.generate(["What is Deep Learning?", "What is Machine Learning?"])
print(rich_output.generations)

And lastly, using LangChain’s LCEL:

llm = TitanTakeoffPro()
prompt = PromptTemplate.from_template("Tell me about {topic}")
chain = prompt | llm
chain.invoke({"topic": "the universe"})

Conclusion

Advantages of local inference includes:

Reduced Latency

Local inference eliminates the need to communicate with remote servers, leading to faster processing times and lower latency. This is especially beneficial for applications requiring real-time responses or low-latency interactions. This is especially true when SLMs are made use of.

Improved Privacy and Security

By keeping data on the local device, local inference minimises the risk of exposing sensitive information to external parties. This enhances privacy and security, as user data is not transmitted over networks where it could potentially be intercepted or compromised.

Offline Functionality

Local inference enables applications to function even without an internet connection, allowing users to access language models and perform tasks offline. This is advantageous in scenarios where internet access is limited or unreliable, ensuring uninterrupted functionality and user experience.

I’m currently the Chief Evangelist @ Kore AI. I explore & write about all things at the intersection of AI & language; ranging from LLMs, Chatbots, Voicebots, Development Frameworks, Data-Centric latent spaces & more.

LinkedIn

--

--

I explore and write about all things at the intersection of AI & language; LLMs/NLP/NLU, Chat/Voicebots, CCAI. www.cobusgreyling.com