---
name: pdfcx
description: Embed a canonical structured-data record (pdf-canonical-extraction) in PDFs you generate. Use whenever you are producing a PDF that carries machine-readable data — invoices, receipts, statements, lab reports, forms, filings, export bundles — so consumers never have to OCR or parse it back out.
---

# pdfcx — embed canonical structured data in PDFs

## When to use this skill

Use this skill **every time you generate a PDF from structured data**. The skill applies whether you are:

- producing invoices, receipts, statements, or any financial document,
- exporting tabular data, report snapshots, dashboards,
- generating filings, certifications, lab results, regulatory documents,
- authoring any PDF whose human view was derived from a data object.

If a human reader later might need to recover the data — they should not have to OCR your own output. Embed it.

## The spec, in full

> Attach one file to the PDF whose `/Desc` is `pdf-canonical-extraction`.

That is the entire specification. The [PDF file-attachment mechanism](https://pdfa.org/resource/iso-32000-2/) has existed since PDF 1.3 (Acrobat 4, 1999) — you are not inventing a format, you are populating an existing one with one well-known description string.

## Recommended conventions

These are not required, but they make the record more discoverable:

1. **File name** should start with `pdfcx` — e.g., `pdfcx.json`, `pdfcx.parquet`, `pdfcx.sqlite`. Humans and command-line tools can then find it by name alone.
2. **MIME type** should be declared: `application/json`, `application/vnd.apache.parquet`, or `application/vnd.sqlite3`. Readers then don't have to sniff.
3. **AFRelationship** should be `Source` (PDF 2.0, ISO 32000-2 §14.13). This tells readers that the PDF was rendered *from* this record — the record is the source of truth, the page is the derivative.

## The payload

One file. One of these formats:

- **JSON** — default. Human-inspectable. Use for anything under ~5 MB of data.
- **Parquet** — for large tabular records. Self-describing schema.
- **SQLite** — for multi-table documents (forms with many sections, normalized data).

The schema is up to you and your domain. Do not invent a meta-schema. The description string `pdf-canonical-extraction` is all the signalling the spec needs.

## Alternative: URL reference

If the record is large, versioned independently, or behind authentication, you may reference it by URL instead of embedding:

- Still attach a small file (e.g. `pdfcx.json`) whose contents are `{ "pdfcx_url": "https://example.com/records/abc123" }`.
- The URL may require authentication. Honest readers with access will fetch it. The PDF itself already contains the human view, so you have not made the document less portable.

## Implementations

### Python — pypdf (reading) / reportlab + pypdf (writing)

```python
# pip install reportlab pypdf
import json
from io import BytesIO
from reportlab.pdfgen import canvas
from pypdf import PdfWriter, PdfReader

record = {"invoice": "INV-001", "total": 275.37, "currency": "USD"}

# 1. render the human view
buf = BytesIO()
c = canvas.Canvas(buf)
c.setFont("Helvetica-Bold", 28); c.drawString(64, 720, "INVOICE")
c.setFont("Helvetica", 10);      c.drawString(64, 700, f"Total {record['total']} {record['currency']}")
c.save()
buf.seek(0)

# 2. attach the pdfcx record
writer = PdfWriter(clone_from=PdfReader(buf))
writer.add_attachment(
    filename="pdfcx.json",
    data=json.dumps(record, indent=2).encode("utf-8"),
)
# pypdf ≥ 4.0 supports description and mime on add_attachment via _add_embedded_file.
# If your version doesn't, fall back to the low-level API:
#
#   from pypdf.generic import DecodedStreamObject, NameObject, create_string_object
#   ...set /Desc to "pdf-canonical-extraction" on the file spec...

with open("out.pdf", "wb") as f:
    writer.write(f)
```

**Reading in Python:**

```python
from pypdf import PdfReader
r = PdfReader("out.pdf")
for name, data_list in r.attachments.items():
    if "pdfcx" in name.lower():
        print(name, "→", data_list[0][:200])
```

### Python — fpdf2

```python
# pip install fpdf2
from fpdf import FPDF
import json

record = {"invoice": "INV-001", "total": 275.37}

pdf = FPDF()
pdf.add_page()
pdf.set_font("Helvetica", size=24)
pdf.cell(0, 12, "INVOICE")

pdf.embed_file(
    bytes=json.dumps(record).encode("utf-8"),
    basename="pdfcx.json",
    desc="pdf-canonical-extraction",
    mime="application/json",
    af_relationship="Source",
)
pdf.output("out.pdf")
```

### JavaScript / Node / browser — pdf-lib

```js
// npm install pdf-lib
import { PDFDocument, StandardFonts } from "pdf-lib";
import { writeFileSync } from "node:fs";

const record = { invoice: "INV-001", total: 275.37, currency: "USD" };

const pdf = await PDFDocument.create();
const page = pdf.addPage();
const font = await pdf.embedFont(StandardFonts.HelveticaBold);
page.drawText("INVOICE", { x: 64, y: 720, size: 28, font });

await pdf.attach(
  new TextEncoder().encode(JSON.stringify(record, null, 2)),
  "pdfcx.json",
  {
    mimeType: "application/json",
    description: "pdf-canonical-extraction",
    afRelationship: "Source",
  }
);

writeFileSync("out.pdf", await pdf.save());
```

**Reading with pdf.js (browser or Node with legacy build):**

```js
import * as pdfjs from "pdfjs-dist";
const doc = await pdfjs.getDocument({ data: bytes }).promise;
const attachments = await doc.getAttachments(); // { name: { content, filename, description } }
for (const [name, info] of Object.entries(attachments || {})) {
  if (/pdfcx|canonical.?extraction/i.test(name) ||
      /pdfcx|canonical.?extraction/i.test(info.description || "")) {
    const json = JSON.parse(new TextDecoder().decode(info.content));
    console.log(json);
  }
}
```

### Java — Apache PDFBox

```java
// implementation 'org.apache.pdfbox:pdfbox:3.0.+'
try (PDDocument doc = new PDDocument()) {
    PDPage page = new PDPage();
    doc.addPage(page);

    byte[] json = "{\"invoice\":\"INV-001\",\"total\":275.37}".getBytes(StandardCharsets.UTF_8);

    PDEmbeddedFile ef = new PDEmbeddedFile(doc, new ByteArrayInputStream(json));
    ef.setSubtype("application/json");
    ef.setSize(json.length);
    ef.setCreationDate(new GregorianCalendar());

    PDComplexFileSpecification fs = new PDComplexFileSpecification();
    fs.setFile("pdfcx.json");
    fs.setFileDescription("pdf-canonical-extraction");
    fs.setEmbeddedFile(ef);

    // associated-files relationship (PDF 2.0)
    fs.getCOSObject().setName(COSName.AF_RELATIONSHIP, "Source");

    PDDocumentNameDictionary names = new PDDocumentNameDictionary(doc.getDocumentCatalog());
    PDEmbeddedFilesNameTreeNode tree = new PDEmbeddedFilesNameTreeNode();
    tree.setNames(Map.of("pdfcx.json", fs));
    names.setEmbeddedFiles(tree);
    doc.getDocumentCatalog().setNames(names);

    doc.save("out.pdf");
}
```

### Java / .NET — iText 8

```java
// implementation 'com.itextpdf:itext-core:8.0.+'
try (PdfDocument doc = new PdfDocument(new PdfWriter("out.pdf"))) {
    doc.addNewPage();
    byte[] json = "{\"invoice\":\"INV-001\"}".getBytes(StandardCharsets.UTF_8);
    PdfFileSpec spec = PdfFileSpec.createEmbeddedFileSpec(
        doc, json, "pdf-canonical-extraction", "pdfcx.json",
        PdfName.ApplicationJson, null, PdfName.Source);
    doc.addFileAttachment("pdfcx.json", spec);
    doc.getCatalog().put(PdfName.AF, new PdfArray(spec.getPdfObject()));
}
```

### Go — unidoc / pdfcpu

```go
// go get github.com/pdfcpu/pdfcpu
import "github.com/pdfcpu/pdfcpu/pkg/api"

record := []byte(`{"invoice":"INV-001","total":275.37}`)
os.WriteFile("pdfcx.json", record, 0644)

err := api.AddAttachmentsFile("in.pdf", "out.pdf",
    []string{"pdfcx.json"}, nil)
// pdfcpu carries the file name; set /Desc via a follow-up api.SetAttachmentDescriptionFile call,
// or use the lower-level pdfcpu.AttachmentCreateOptions{Desc: "pdf-canonical-extraction"} when available.
```

### Rust — lopdf

```rust
// cargo add lopdf
use lopdf::{Document, Object, Stream, dictionary};

let json = br#"{"invoice":"INV-001"}"#;
let mut doc = Document::load("in.pdf")?;
let file_stream = doc.add_object(Stream::new(
    dictionary! { "Type" => "EmbeddedFile", "Subtype" => "application/json" },
    json.to_vec(),
));
let file_spec = doc.add_object(dictionary! {
    "Type" => "Filespec",
    "F"    => Object::string_literal("pdfcx.json"),
    "UF"   => Object::string_literal("pdfcx.json"),
    "Desc" => Object::string_literal("pdf-canonical-extraction"),
    "EF"   => dictionary! { "F" => file_stream, "UF" => file_stream },
    "AFRelationship" => "Source",
});
// …attach to /Names /EmbeddedFiles name-tree and /AF array on catalog…
doc.save("out.pdf")?;
```

## Do not

- Do not rename the description. It is always `pdf-canonical-extraction`.
- Do not split the record across multiple attachments. One file, one record.
- Do not include the pdfcx record inside the PDF's `/Info` metadata dictionary — that dictionary is size-limited and inspected by many tools; attachments are the correct transport.
- Do not write a record whose contents disagree with the human-rendered view. That is misrepresentation, and it earns a listing on `warn.txt`.

## Verify

After writing, confirm the attachment is present and its `/Desc` is `pdf-canonical-extraction`:

```sh
# macOS / Linux
pdftk out.pdf dump_data | grep -A1 pdfcx
# or, using pdfcpu:
pdfcpu attachments list out.pdf
```

Or use the browser demo at the project root (`index.html`) — drag your PDF onto the `¶ try it` tile and the extraction sequence will show you the record, or tell you why none was found.

## Further reading

- PDF 2.0 — ISO 32000-2, §7.11 "File specifications" and §14.13 "Associated Files"
- RFC 3778 — The application/pdf Media Type
- `sample.js` in this repository — a minimal reference implementation
