Factory-driven glass extraction. SOTA PDF parsing. Snake SAT matching.
Every extraction is scoped to a factory_id. Normalization rules, Snake matching, and business logic are factory-specific. VIT, Monce, VIP, Eurovitrage, TGVI.
Complex email threads are split into email body + attachments. Context extracted from emails informs the attachment extraction — no information lost.
PyMuPDF for text + image rendering. pdfplumber for table structure detection. Vision LLM for scanned pages. Hybrid approach handles everything.
Seamless integration with snake.aws.monce.ai. 3-tier cascade: Snake exact (3ms) → Fuzzy (1ms) → LLM fallback. Factory-scoped article matching.