VLM & Hybrid Backends
VLM Backend
The VLM backend uses a single Vision-Language Model — Qwen2-VL (7B or 72B parameters) — to perform end-to-end document parsing. One model handles layout detection, text recognition, formula extraction, and table reconstruction simultaneously.
Supported Inference Engines
| Engine | Install Extra | Best For |
|---|---|---|
vllm-engine |
vlm-vllm |
Production, high-throughput |
lmdeploy-engine |
vlm-lmdeploy |
Deployment-optimized serving |
transformers-engine |
vlm-transformers |
Development, debugging |
Usage
vector-feed -p document.pdf --backend vlm \
--vlm-model Qwen/Qwen2-VL-7B-Instruct \
--vlm-engine vllm-engineHybrid Backend
The Hybrid backend combines VLM coarse layout detection with expert model refinement:
- VLM pass: Identifies page layout, reading order, and content types at a high level.
- Expert refinement: Pipeline models (table structure, formula verification, OCR) refine specific regions identified by the VLM.
This balances the contextual awareness of VLM with the precision of specialized models.
Model Singleton
Both backends use thread-safe singletons (ModelSingleton
/ AtomModelSingleton) to prevent redundant model loading
across concurrent requests.