Skip to main content
← Back to Vision Models LFM2-VL-3B is Liquid AI’s highest-capacity multimodal model, delivering enhanced visual reasoning and detailed image understanding. Ideal for complex vision tasks requiring deeper comprehension.

Specifications

PropertyValue
Parameters3B
Context Length32K tokens
ArchitectureLFM2-VL (Dense)

Advanced Reasoning

Complex visual logic and analysis

Document Understanding

Detailed document and chart parsing

Multi-Image

Compare and reason across images

Quick Start

Install:
pip install git+https://github.com/huggingface/transformers.git@3c2517727ce28a30f5044e01663ee204deb1cdbe pillow torch
Download & Run:
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image

model_id = "LiquidAI/LFM2-VL-3B"
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="bfloat16"
)
processor = AutoProcessor.from_pretrained(model_id)

url = "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
image = load_image(url)

conversation = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": "What is in this image?"},
        ],
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
    tokenize=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=256)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]
print(response)