AI Explained: LLM Performance -Slow Python Transformers, Fast Golang & Rust, but not always
My next blog post on LLMs was going to be how performance can be optimised by going from a Python model to a Rust-based model. For this I made a simple Python model.
import time
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
# Hugging Face model identifier
MODEL_NAME = "TinyLlama/TinyLlama-1.1B-Chat-v1.0"
# User and system prompts
user_prompt = "How many helicopters can a human eat in one sitting?"
system_prompt = (
"<|system|>\nYou are a friendly chatbot who always responds in the style of a pirate</s>\n"
"<|user|>\n{prompt}</s>\n<|assistant|>\n"
).format(prompt=user_prompt)
# Check if CUDA is available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
# Start timing for model loading
start_time = time.time()
# Load the tokenizer and model from Hugging Face
print("Loading tokenizer and model...")
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)
model = AutoModelForCausalLM.from_pretrained(MODEL_NAME).to(device)
# Ensure the tokenizer has a padding token
if tokenizer.pad_token is None:
tokenizer.add_special_tokens({'pad_token': tokenizer.eos_token})
model.resize_token_embeddings(len(tokenizer))
# End timing for model loading
load_time = time.time() - start_time
print(f"Model loaded in {load_time:.2f} seconds.")
# Tokenize the input
print("Tokenizing input...")
inputs = tokenizer(
system_prompt,
return_tensors="pt",
padding=True,
truncation=True,
).to(device)
# Generate response with timing
print("Generating response...")
start_time = time.time()
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=200, # Limit response length
temperature=0.7, # Adjusts randomness in output
top_p=0.9, # Nucleus sampling
do_sample=True,
)
generation_time = time.time() - start_time
num_tokens = outputs.shape[1] # Number of tokens generated
tokens_per_second = num_tokens / generation_time
# Decode and print the generated text
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print("Response:\n", response)
# Output timing metrics
print(f"Generation completed in {generation_time:.2f} seconds.")
print(f"Tokens generated per second: {tokens_per_second:.2f}")
So the above loads a small Llama 3.1 model (Tiny 1.1B to be exact) and sends it a system and a user prompt. The loading of the model as well as the token generation total time and tokens per second are measured.
At the same time I downloaded burn-llama from https://github.com/tracel-ai/models.git
Build it: cargo build — release — features tiny,cuda — example chat
Asked it the same question with the same parameters: ./target/release/examples/chat — top-p 0.9 — temperature=0.7 — max-seq-len=200
My expectation was that Rust would be many times faster. To my surprise Rust did this:
9 seconds to load the model, 10 seconds to generate and 6.4 tokens per second.
On the other hand, Python Transformers did this:
4.6 seconds loading time, 2.46 seconds generation time and 50.47 tokens per second. Almost 8 times faster!!!
So I tried Candle but did not even get it to work correctly:
I also tried ollama [written in Golang] which was a lot better at 137.38 tokens per second. Almost 3 times more than Python Transformers and in line with my expectations.
Luckily I used WasmEdge before and by building their WASI-NN plugin, downloading the llama-chat.wasm from https://github.com/LlamaEdge/LlamaEdge, compiling the wasm for better performance I was able to use the TinyLlama-1.1B-Chat-v1.0-Q5_K_M.gguf and get the following performance:
Load times 3.35 seconds and 121.74 tokens per second.
So although Burn is a great abstraction to create LLM models, performance wise it still need some time. I was able to do other things with Candle and given that Hugging Face is behind it and their support is great, I should hopefully soon be able to update the post with better numbers. Candle however is not for the faint of heart because it is very low-level. Ollama is easy to use. WasmEdge is very promising for Llama serverless. It also was the solution that best used my three GPU:
As you can see all of the GPUs were used instead of the others who only used GPU 1. Ollama left 12% of the GPU occupied whereas WasmEdge only 3% on each for a similar model.
The beauty of WasmEdge + LlamaEdge is also the ease of deployment. Deploying Python requires multi-GB Docker containers, whereas llama-chat_opt.wasm is only 14.5MB and all code is contained into one. So deploying updated model code could be really efficiently done.
See other posts in the AI explained series:
- Retrieval Augmented Generation
- LLMs for content classification
- Integrating external tools with LangChain
- Image to text and computer vision
- MLOps and launching AI in production
If you are more interested in understanding in the business side of AI:
If your business needs help with AI, why don’t we connect?