SGLang

Use SGLang for ultra-low latency, high-throughput production serving with many concurrent requests.

SGLang requires a CUDA-compatible GPU. For CPU-only environments, consider using llama.cpp instead.

Supported Models

Model Type	Status	Examples
Dense text models	Supported	LFM2-350M, LFM2.5-1.2B-Instruct, LFM2-2.6B
MoE text models	Supported	LFM2-8B-A1B, LFM2-24B-A2B
Vision models	Supported	LFM2-VL-450M, LFM2-VL-3B, LFM2.5-VL-1.6B

All LFM model types are supported as of SGLang v0.5.10.

Installation

Install SGLang following the official installation guide. The recommended method is (requires sglang>=0.5.10):

pip install --upgrade pip
pip install uv
uv pip install "sglang>=0.5.10"

Launching the Server

By default the model runs in bfloat16. To use float16 instead, add --dtype float16 and set export SGLANG_MAMBA_CONV_DTYPE=float16 before launching.

Python
Docker

sglang serve \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2

All LFM model types (dense, MoE, vision) are supported in the v0.5.10 tag and later.

For CUDA 13 environments (B300/GB300), use lmsysorg/sglang:v0.5.10-cu13
The HF_TOKEN env var is optional, but can speed up downloads and reduce retry errors. We recommend a read-only Hugging Face token for reliability.

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:v0.5.10 \
    sglang serve \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2

Usage

SGLang exposes an OpenAI-compatible API.

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="None"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get current weather for a location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {"type": "string", "description": "City name"}
            },
            "required": ["location"]
        }
    }
}]

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-1.2B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in San Francisco?"}
    ],
    tools=tools,
    tool_choice="auto",
    temperature=0
)

print(response.choices[0].message)

For more details on tool use with LFM models, see Tool Use.

Vision Models

Launch a vision-language model:

sglang serve \
    --model-path LiquidAI/LFM2.5-VL-1.6B \
    --host 0.0.0.0 \
    --port 30000

Query with an image:

from openai import OpenAI

client = OpenAI(base_url="http://localhost:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="LiquidAI/LFM2.5-VL-1.6B",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url", "image_url": {"url": "http://images.cocodataset.org/val2017/000000039769.jpg"}},
            {"type": "text", "text": "Describe what you see in this image."},
        ],
    }],
    temperature=0.0,
    max_tokens=256,
)

print(response.choices[0].message.content)

Curl request example

curl http://localhost:30000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "LiquidAI/LFM2.5-1.2B-Instruct",
    "messages": [
      {"role": "user", "content": "What is AI?"}
    ],
    "temperature": 0
  }'

Offline Inference

SGLang’s Engine class provides a simple interface for offline inference without launching a server. This is useful for scripts, notebooks, and batch processing.

In Jupyter notebooks, you must apply nest_asyncio before creating the engine, because SGLang uses an async event loop internally.

Text Generation

# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()

import sglang as sgl

llm = sgl.Engine(model_path="LiquidAI/LFM2-8B-A1B")

tokenizer = llm.tokenizer_manager.tokenizer
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

output = llm.generate(prompt=prompt, sampling_params={"max_new_tokens": 128, "temperature": 0})
print(output["text"])

llm.shutdown()

Vision Models

# Required in Jupyter notebooks:
# import nest_asyncio
# nest_asyncio.apply()

import sglang as sgl

vlm = sgl.Engine(model_path="LiquidAI/LFM2.5-VL-1.6B")

processor = vlm.tokenizer_manager.processor
messages = [{"role": "user", "content": [
    {"type": "image", "image": "placeholder"},
    {"type": "text", "text": "Describe what you see in this image."},
]}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

output = vlm.generate(
    prompt=prompt,
    image_data="http://images.cocodataset.org/val2017/000000039769.jpg",
    sampling_params={"max_new_tokens": 256, "temperature": 0},
)
print(output["text"])

vlm.shutdown()

Low Latency on Blackwell (B300)

Running a 1.2B model on a B300 may sound counterintuitive, but combining --enable-torch-compile with Blackwell’s architecture unlocks extremely low latency — ideal for latency-sensitive workloads like RAG, search, and real-time chat.

We recommend --enable-torch-compile for workloads with concurrency under 256. For pure throughput batch processing at very high concurrency, skip this flag.

Key flags for low latency:

--enable-torch-compile: Compiles the model with Torch for faster execution. Adds startup time but significantly reduces per-token latency.
--chunked-prefill-size -1: Disables chunked prefill, processing the full prompt in one pass. This lowers TTFT at the cost of slightly reduced throughput under high concurrency.

sglang serve \
    --model-path LiquidAI/LFM2.5-1.2B-Instruct \
    --host 0.0.0.0 \
    --port 30000 \
    --tool-call-parser lfm2 \
    --enable-torch-compile \
    --chunked-prefill-size -1

On B300/CUDA 13, use the dedicated Docker image:

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:v0.5.10-cu13 \
    sglang serve \
        --model-path LiquidAI/LFM2.5-1.2B-Instruct \
        --host 0.0.0.0 \
        --port 30000 \
        --tool-call-parser lfm2 \
        --enable-torch-compile \
        --chunked-prefill-size -1

Example benchmark on a B300 GPU with CUDA 13 (256 prompts, 1024 input tokens, 128 output tokens, max concurrency 1):

Metric	Value
Mean TTFT (ms)	8.79
Mean TPOT (ms)	0.86
Output token throughput (tok/s)	1100.92

Benchmark command

python3 -m sglang.bench_serving \
    --backend sglang-oai-chat \
    --num-prompts 256 \
    --max-concurrency 1 \
    --random-input-len 1024 \
    --random-output-len 128 \
    --warmup-requests 128

Models

Fine-tuning

Edge Inference

GPU Inference

Cloud Inference

Help

Supported Models

Installation

Launching the Server

Usage

Vision Models

Offline Inference

Text Generation

Vision Models

Low Latency on Blackwell (B300)

Models

Fine-tuning

Edge Inference

GPU Inference

Cloud Inference

Help

Documentation Index

​Supported Models

​Installation

​Launching the Server

​Usage

​Vision Models

​Offline Inference

​Text Generation

​Vision Models

​Low Latency on Blackwell (B300)

Supported Models

Installation

Launching the Server

Usage

Vision Models

Offline Inference

Text Generation

Vision Models

Low Latency on Blackwell (B300)