It’s time for document extraction to die

Companies pay billions to push data into PDFs. Billions more to pull it back out, imperfectly. It’s time to stop burning money on this.

¶the absurdity

Billions in. Billions out. Accuracy lost in between.

Companies spend fortunes pushing structured data into PDF documents. Other companies spend fortunes pulling that same data back out, imperfectly, through OCR and machine learning and best-effort parsing. Somewhere in the middle, accuracy quietly disappears.

All of this happens because the world is still thinking in terms of paper compatibility. Nobody prints anymore.

It is an arms race to the bottom. Billions of dollars. For what?

¶in defence of the PDF

The PDF is not the villain. Trapping the data is.

PDFs are excellent at human consumption. They are portable as a record. They have lasted decades and they will last decades more. We are not proposing their retirement. We are proposing their completion.

A PDF that carries its own structured truth is not less of a PDF. It is a better one.

¶machine readability

For humans and agents alike.

Accessibility tools do their best with rendered glyphs, but every form, table, and financial statement they read is a reconstruction, a guess at what the original data looked like. A PDF with an embedded pdfcx record hands them the truth directly: no OCR heuristics, no table-detection guesses, no lost cells or footnotes.

The next generation of document readers is not human.

Every AI agent that will soon touch your PDFs, to file taxes, reconcile invoices, summarise lab reports, fill forms, search case law, is forced today to OCR and guess. The agent’s accuracy inherits all of OCR’s brittleness. Give the agent the pdfcx record and the guessing stops.

Accessibility was always the argument. AI agents just make it urgent.

¶the proposal

pdf-canonical-extraction. One attachment. Open spec.

Embed a single record of the document’s structured data directly inside the PDF as an attached file. Or reference it by URL. The PDF specification has allowed file attachments since 1999.

nickname: pdfcx
/Desc: pdf-canonical-extraction
formats: JSON · Parquet · SQLite
transport: embedded attachment, or URL
auth: optional, the PDF already carries the human view

One attachment. No fee. No vendor. No roadmap. Here's a skill file for your coding agents.

¶try it

Here's a demo.

If the PDF carries a pdfcx record, you will see it below. No OCR, no parsing, no guessing. If it does not, you will see what we have all been settling for.

drop a PDF

¶adopters

Who ships with pdfcx.

Live roster from promote.txt. Send a pull request to add yourself.

loading the roster…

¶the only real risk

We also track misrepresentation.

The only way this fails is if someone writes one thing in the document’s human view and something else in its data. We treat that as malpractice, and we keep a public list. Names appear below with reproducible evidence.

Live roster from warn.txt.

loading the roster…

¶the economic reality

We understand. But it’s time to put users first.

Some companies deliberately rasterise their PDF exports to PNG, rotate them by one to five degrees, and re-export, deliberately making their own documents harder to read back. Other companies have built entire businesses, entire investor rounds, on extracting data that was never supposed to have been lost.

We acknowledge that many companies and investors are counting on the business model of document extraction to continue. But it’s time to put this era behind us and put users first. The accuracy of business data is worth more than the revenue of approximately extracting it.

¶acknowledgements

We are not the first to say this.

Dittrich & Bender’s Janiform Intra-Document Analytics for Reproducible Research (VLDB 2015) introduced Portable Database Files for research papers. Germany and France have mandated ZUGFeRD / Factur-X for electronic invoicing, using PDF/A-3 with embedded CII XML per EN 16931. The US SEC and the EU ESMA require Inline XBRL for financial filings. pdfcx is not invention. It is consensus: that every PDF, not just those from tax authorities and research groups, should carry its own truth.