Kreuzberg: Polyglot Document Intelligence — Extract Any File Format with a Single Rust-Powered Framework |

摘要：A high-performance, polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured data from 97+ file formats — available in Rust, Python, Go, Java, and 8 more languages.

文章目录

Rust Core, Universal Bindings — The extraction engine is written in Rust for maximum performance and memory safety. Pre-built bindings for 10+ languages mean you can integrate Kreuzberg into any tech stack without a native compilation step. 97+ Format Support — From PDFs (including password-protected and multi-layer PDFs) to DOCX, EPUB, images, HTML, Markdown, and legacy formats like RTF — Kreuzberg handles them all with a consistent output schema. Flexible Output Formats — Extract content as plain text, Markdown, HTML, or structured JSON. Configure what you want: page text, headings, tables, images, metadata, or all of the above. A powerful ExtractionConfig lets you tune every aspect of the pipeline. MCP Server Included — An official Model Context Protocol server ships out of the box, making it trivial to expose document extraction as a tool to any MCP-compatible AI agent (Claude, Codex, Gemini, etc.). Active Development — The project ships frequent releases, and the maintainer community is highly responsive to both bug reports and feature requests via GitHub Issues.
What makes Kreuzberg stand out is its genuinely active and technical GitHub community. Issues aren't just filed — they're triaged, root-caused, and fixed with real technical depth. Here are some highlights from recent discussions: Issue #917 — Bug: MCID content dropped from PDF extraction in markdown/html output (2 comments, labeled bug) User @kh3rld reported that text content inside PDF Marked-Content (MCID) blocks — common in professionally designed PDFs — was silently dropped when using output_format="markdown" or output_format="html", even though output_format="plain" captured it correctly. The investigation revealed two independent root causes: (1) a margin-text classifier that marks recurring text as "page furniture" without a first-occurrence exemption, stripping titles from page 1, and (2) the markdown/html pipeline routing through a more limited extraction path than the plain output. The issue is actively being triaged by the maintainer. Key takeaway for users: If you're extracting text from complex PDFs (especially those generated from design tools like InDesign), verify your output against the plain format. View Issue #917 → Issue #937 — Bug: ExtractionConfig(cancel_token=...) raises TypeError despite type stub declaring it as kwarg (1 comment, labeled bug) A user reported that Python's type checker accepted cancel_token as a constructor parameter, but the compiled Rust extension raised a TypeError at runtime. Maintainer @Goldziher confirmed and explained the fix: the pyo3 signature binding was updated to explicitly list cancel_token=None, and the Python stub file was synced accordingly. The fix is already on main and will ship in v5.0.0-rc.1 as part of the alef-generated bindings migration. Key takeaway for users: If you're using cancel_token in the constructor, be aware of this bug in v4.9.7. The workaround is to use the post-construct attribute form: cfg.cancel_token = CancellationToken(). View Issue #937 → Issue #940 — Bug: composer require kreuzberg/kreuzberg fails because type: php-ext routes through PIE (1 comment, labeled bug) A PHP developer opened a detailed bug report showing that Composer 2.7+ redirects type: php-ext packages to PIE (PHP Installer for Extensions), but PIE runs phpize and looks for a config.m4 file — which a Rust extension obviously doesn't have. Maintainer @Goldziher responded that the root cause was simply missing release assets, not a packaging mismatch. PIE's workflow for pre-packaged-binary looks for platform-specific binaries in the GitHub release, and those weren't yet published for all platforms. The issue has been resolved with the release assets now in place. Key takeaway for PHP developers: Make sure you're on the latest release when installing via PIE — the maintainer is now publishing platform-specific binaries. View Issue #940 →
Kreuzberg is the document intelligence framework that finally solves the "format soup" problem once and for all. Its Rust-powered extraction engine is fast and reliable, its multi-language bindings mean it fits into any project, and its active issue tracker shows a maintainer team that cares deeply about correctness. Whether you're processing PDFs for a RAG system, building a document parser for an AI agent, or just tired of juggling five different libraries for five different file types — Kreuzberg deserves a spot in your stack. The project is completely open source and welcomes contributions. If you hit a bug, the community is responsive. If you have a feature request, open an issue and it will likely get a thoughtful response from the core team. @kreuzberg-dev · https://github.com/kreuzberg-dev/kreuzberg · ⭐ 8,304

Kreuzberg is a high-performance, polyglot document intelligence framework with a blazing-fast Rust core. Built for developers who need to extract text, metadata, images, and structured data from PDFs, Office documents, scanned images, and over 97 other file formats — Kreuzberg delivers it all through a unified API available in Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, and TypeScript.

Whether you're building a RAG pipeline, processing legal documents, or powering an AI agent's file understanding, Kreuzberg abstracts away the complexity of parsing dozens of formats behind a single, clean interface. The project is actively maintained with recent pushes nearly every day, and its GitHub community is buzzing with feature requests, bug reports, and creative integrations.