VECTOR FEED

Output Formats

middle_json

The middle_json is the normalized intermediate representation that sits between model inference and final output. All backends produce this format, and all output generators consume it.

Structure

{
  "page_num": 1,
  "page_size": {"width": 595.28, "height": 841.89},
  "layout": [
    {
      "type": "text" | "title" | "image" | "table" | "formula",
      "bbox": [x0, y0, x1, y1],
      "text": "Extracted content...",
      "formula_latex": "\\int_{a}^{b} f(x)\\,dx",
      "table_data": {"rows": [...], "cols": [...]},
      "image_path": "page_1_img_0.png"
    }
  ],
  "metadata": {
    "language": "en",
    "backend": "pipeline",
    "processing_time_ms": 2340
  }
}

Markdown Output

Standard Mode

  • Headings for detected titles (#, ##, etc.)
  • Text blocks in reading order
  • LaTeX formulas in $$ blocks
  • HTML tables for structured data
  • Image references with alt text

NLP-Optimized Mode

  • Continuous text without page breaks
  • Merged truncated paragraphs
  • Stripped headers, footers, and watermarks (Data Rinsing™)
  • Removed page numbers and navigation artifacts

Content List

A flat list of all detected content blocks with their types, bounding boxes, and extracted text, used for downstream programmatic processing.