Clone the repository
Option 1. Use vLLM docker image
You can use the vLLM docker image vllm/vllm-openai to deploy LFM.
Launch command:
This is the recommended approach for production deployment.
Option 2. Use vLLM PyPI package
Alternatively, you can also use the vLLM PyPI package to deploy LFM. This approach is based on the Modal example for deploying OpenAI-compatible LLM service with vLLM, with a few modifications.
Launch command:
(Click to see detailed modifications)
(Click to see detailed modifications)
-
Change the
MODEL_NAMEandMODEL_REVISIONto the latest LFM model.-
E.g. for
LFM2-8B-A1B:MODEL_NAME = "LiquidAI/LFM2-8B-A1B"MODEL_REVISION = "6df6a75822a5779f7bf4a21e765cb77d0383935d"
-
E.g. for
-
Optionally, turn off
FAST_BOOT. -
Optionally, add these environment variables:
HF_XET_HIGH_PERFORMANCE=1,VLLM_USE_V1=1,VLLM_USE_FUSED_MOE_GROUPED_TOPK=0.
-
Optionally, add these launch arguments:
--dtype bfloat16--gpu-memory-utilization 0.6--max-model-len 32768--max-num-seqs 600--compilation-config '{"cudagraph_mode": "FULL_AND_PIECEWISE"}'
Production deployment
- Prefer the
deploy-vllm-docker.pyscript. - Since vLLM takes over 2 min to cold start, if you run the inference server for production, it is recommended to keep a minimum number of warm instances with
min_containers = 1andbuffer_containers = 1. Thebuffer_containersconfig is necessary because all Modal GPUs are subject to preemption. See docs for details about cold start performance tuning. - Warm up the vLLM server after deployment by sending a single request. The warm-up process is included in the deploy-vllm-docker.py script already.
Test commands
Test the deployed server with the followingcurl commands (replace <modal-deployment-url> with your actual deployment URL):