Modal

Use Modal for serverless cloud deployments with instant autoscaling, GPU access, and production-ready inference serving.

Clone the repository

git clone https://github.com/Liquid4All/lfm-inference

Option 1. Use `vLLM` docker image

You can use the vLLM docker image vllm/vllm-openai to deploy LFM. Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-docker.py

# deploy other LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-docker.py

See full list of open source LFM models on Hugging Face.

This is the recommended approach for production deployment.

Option 2. Use `vLLM` PyPI package

Alternatively, you can also use the vLLM PyPI package to deploy LFM. This approach is based on the Modal example for deploying OpenAI-compatible LLM service with vLLM, with a few modifications. Launch command:

cd modal

# deploy LFM2 8B MoE model
modal deploy deploy-vllm-pypi.py

# deploy any LFM2 model, MODEL_NAME defaults to LiquidAI/LFM2-8B-A1B
MODEL_NAME=LiquidAI/<model-slug> modal deploy deploy-vllm-pypi.py

(Click to see detailed modifications)

Change the MODEL_NAME and MODEL_REVISION to the latest LFM model.
- E.g. forLFM2-8B-A1B:
  - MODEL_NAME = "LiquidAI/LFM2-8B-A1B"
  - MODEL_REVISION = "6df6a75822a5779f7bf4a21e765cb77d0383935d"
Optionally, turn off FAST_BOOT.
Optionally, add these environment variables:
- HF_XET_HIGH_PERFORMANCE=1,
- VLLM_USE_V1=1,
- VLLM_USE_FUSED_MOE_GROUPED_TOPK=0.
Optionally, add these launch arguments:
- --dtype bfloat16
- --gpu-memory-utilization 0.6
- --max-model-len 32768
- --max-num-seqs 600
- --compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'

Production deployment

Prefer the deploy-vllm-docker.py script.
Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with min_containers = 1 and buffer_containers = 1. The buffer_containers config is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning.
Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm-docker.py script already.

Test commands

Test the deployed server with the following curl commands (replace <modal-deployment-url> with your actual deployment URL):

# List deployed model
curl https://<modal-deployment-url>/v1/models

# Query the deployed LFM model
curl https://<modal-deployment-url>/v1/chat/completions \
  --json '{
    "model": "LiquidAI/LFM2-8B-A1B",
    "messages": [
      {
        "role": "user",
        "content": "What is the melting temperature of silver?"
      }
    ],
    "max_tokens": 32,
    "temperature": 0
  }'

Get Started

Models

Key Concepts

Inference

Fine-tuning

Frameworks

Help

Clone the repository

Option 1. Use `vLLM` docker image

Option 2. Use `vLLM` PyPI package

Production deployment

Test commands

Get Started

Models

Key Concepts

Inference

Fine-tuning

Frameworks

Help

​Clone the repository

​Option 1. Use vLLM docker image

​Option 2. Use vLLM PyPI package

​Production deployment

​Test commands

Clone the repository

Option 1. Use `vLLM` docker image

Option 2. Use `vLLM` PyPI package

Production deployment

Test commands