Documentation Index Fetch the complete documentation index at: https://liquidai-link-snapshot-contract.mintlify.app/llms.txt
Use this file to discover all available pages before exploring further.
Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.
SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.
Supported Models
Model Type Status Examples Dense text models Supported LFM2-350M, LFM2.5-1.2B-Instruct, LFM2-2.6B MoE text models Supported LFM2-8B-A1B, LFM2-24B-A2B Vision models Supported LFM2-VL-450M, LFM2-VL-3B, LFM2.5-VL-1.6B
All LFM model types are supported as of SGLang v0.5.10.
Installation
Install SGLang following the official installation guide . The recommended method is (requires sglang>=0.5.10):
pip install --upgrade pip
pip install uv
uv pip install "sglang>=0.5.10"
Launching the Server
By default the model runs in bfloat16. To use float16 instead, add --dtype float16 and set export SGLANG_MAMBA_CONV_DTYPE=float16 before launching.
sglang serve \
--model-path LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tool-call-parser lfm2
All LFM model types (dense, MoE, vision) are supported in the v0.5.10 tag and later.
For CUDA 13 environments (B300/GB300), use lmsysorg/sglang:v0.5.10-cu13
The HF_TOKEN env var is optional, but can speed up downloads and reduce retry errors. We recommend a read-only Hugging Face token for reliability.
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:v0.5.10 \
sglang serve \
--model-path LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tool-call-parser lfm2
SGLang exposes an OpenAI-compatible API.
from openai import OpenAI
client = OpenAI(
base_url = "http://localhost:30000/v1" ,
api_key = "None"
)
tools = [{
"type" : "function" ,
"function" : {
"name" : "get_weather" ,
"description" : "Get current weather for a location" ,
"parameters" : {
"type" : "object" ,
"properties" : {
"location" : { "type" : "string" , "description" : "City name" }
},
"required" : [ "location" ]
}
}
}]
response = client.chat.completions.create(
model = "LiquidAI/LFM2.5-1.2B-Instruct" ,
messages = [
{ "role" : "user" , "content" : "What's the weather in San Francisco?" }
],
tools = tools,
tool_choice = "auto" ,
temperature = 0
)
print (response.choices[ 0 ].message)
For more details on tool use with LFM models, see Tool Use .
Vision Models
Launch a vision-language model:
sglang serve \
--model-path LiquidAI/LFM2.5-VL-1.6B \
--host 0.0.0.0 \
--port 30000
Query with an image:
from openai import OpenAI
client = OpenAI( base_url = "http://localhost:30000/v1" , api_key = "None" )
response = client.chat.completions.create(
model = "LiquidAI/LFM2.5-VL-1.6B" ,
messages = [{
"role" : "user" ,
"content" : [
{ "type" : "image_url" , "image_url" : { "url" : "http://images.cocodataset.org/val2017/000000039769.jpg" }},
{ "type" : "text" , "text" : "Describe what you see in this image." },
],
}],
temperature = 0.0 ,
max_tokens = 256 ,
)
print (response.choices[ 0 ].message.content)
curl http://localhost:30000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "LiquidAI/LFM2.5-1.2B-Instruct",
"messages": [
{"role": "user", "content": "What is AI?"}
],
"temperature": 0
}'
Offline Inference
SGLang’s Engine class provides a simple interface for offline inference without launching a server. This is useful for scripts, notebooks, and batch processing.
In Jupyter notebooks, you must apply nest_asyncio before creating the engine, because SGLang uses an async event loop internally.
Text Generation
# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()
import sglang as sgl
llm = sgl.Engine( model_path = "LiquidAI/LFM2-8B-A1B" )
tokenizer = llm.tokenizer_manager.tokenizer
messages = [{ "role" : "user" , "content" : "What is the capital of France?" }]
prompt = tokenizer.apply_chat_template(messages, tokenize = False , add_generation_prompt = True )
output = llm.generate( prompt = prompt, sampling_params = { "max_new_tokens" : 128 , "temperature" : 0 })
print (output[ "text" ])
llm.shutdown()
Vision Models
# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()
import sglang as sgl
vlm = sgl.Engine( model_path = "LiquidAI/LFM2.5-VL-1.6B" )
processor = vlm.tokenizer_manager.processor
messages = [{ "role" : "user" , "content" : [
{ "type" : "image" , "image" : "placeholder" },
{ "type" : "text" , "text" : "Describe what you see in this image." },
]}]
prompt = processor.apply_chat_template(messages, tokenize = False , add_generation_prompt = True )
output = vlm.generate(
prompt = prompt,
image_data = "http://images.cocodataset.org/val2017/000000039769.jpg" ,
sampling_params = { "max_new_tokens" : 256 , "temperature" : 0 },
)
print (output[ "text" ])
vlm.shutdown()
Low Latency on Blackwell (B300)
Running a 1.2B model on a B300 may sound counterintuitive, but combining --enable-torch-compile with Blackwell’s architecture unlocks extremely low latency — ideal for latency-sensitive workloads like RAG, search, and real-time chat.
We recommend --enable-torch-compile for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.
Key flags for low latency:
--enable-torch-compile: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
--chunked-prefill-size -1: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.
sglang serve \
--model-path LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tool-call-parser lfm2 \
--enable-torch-compile \
--chunked-prefill-size -1
On B300/CUDA 13, use the dedicated Docker image:
docker run --gpus all \
--shm-size 32g \
-p 30000:30000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=<secret>" \
--ipc=host \
lmsysorg/sglang:v0.5.10-cu13 \
sglang serve \
--model-path LiquidAI/LFM2.5-1.2B-Instruct \
--host 0.0.0.0 \
--port 30000 \
--tool-call-parser lfm2 \
--enable-torch-compile \
--chunked-prefill-size -1
Example benchmark on a B300 GPU with CUDA 13 (256 prompts, 1024 input tokens, 128 output tokens, max concurrency 1):
Metric Value Mean TTFT (ms) 8.79 Mean TPOT (ms) 0.86 Output token throughput (tok/s) 1100.92
python3 -m sglang.bench_serving \
--backend sglang-oai-chat \
--num-prompts 256 \
--max-concurrency 1 \
--random-input-len 1024 \
--random-output-len 128 \
--warmup-requests 128