A PDF creation library written in Rust, designed for SaaS and web applications. Low memory and CPU consumption — even for documents with hundreds of pages.
PdfReader allows opening an existing PDF file and inspecting its basic properties. This is the foundation for future features such as field extraction, form filling, and PDF merging.
The initial implementation supports the most common use case: counting the number of pages in a PDF and reading its version string.
PDF files are structured as:
%PDF-x.y version declaration/Root and /Size, followed by startxref and %%EOFPdfReader parses these in reverse order (the recommended approach per the PDF spec):
startxref to get the xref table offsetobject number → byte offset map/Root (catalog) reference/Pages reference/CountThe raw bytes and xref map are retained on the PdfReader struct, ready for future object resolution.
use pdf_core::{PdfReadError, PdfReader};
// From a file
let reader = PdfReader::open("document.pdf")?;
// From bytes (e.g. from a network response or in-memory buffer)
let bytes: Vec<u8> = std::fs::read("document.pdf")?;
let reader = PdfReader::from_bytes(bytes)?;
// Inspect
println!("Pages: {}", reader.page_count()); // e.g. 42
println!("Version: {}", reader.pdf_version()); // e.g. "1.7"
// From a file
$reader = PdfReader::open("document.pdf");
// From bytes
$bytes = file_get_contents("document.pdf");
$reader = PdfReader::fromBytes($bytes);
// Inspect
echo $reader->pageCount(); // e.g. 42
echo $reader->pdfVersion(); // e.g. "1.7"
PdfReader::from_bytes() and PdfReader::open() return Result<PdfReader, PdfReadError>.
| Error variant | Meaning |
|---|---|
NotAPdf |
The data does not begin with %PDF- |
StartxrefNotFound |
The startxref keyword is missing from the last 1024 bytes |
MalformedXref |
The xref table cannot be parsed |
MalformedTrailer |
The trailer dictionary is missing or lacks /Root |
XrefStreamNotSupported |
The PDF uses a cross-reference stream (PDF 1.5+) — see Limitations |
UnresolvableObject(n) |
Object n referenced in the xref map cannot be parsed |
MalformedPageTree |
The catalog or pages object is missing required entries |
Io(msg) |
A file I/O error occurred |
The PDF spec recommends starting from the end of the file because appended incremental updates push new xref tables and trailers toward the end. Starting from startxref ensures the most recent xref table is used.
PdfReader holds data: Vec<u8> and xref: HashMap<u32, usize> even though they are not currently exposed publicly. This is intentional: future issues for field extraction, annotation reading, or page merging will need to resolve arbitrary objects without re-reading the file.
The minimal dictionary parser extracts only name → first-token pairs. For indirect references (N G R), only the object number N is stored. This is sufficient for following the Catalog → Pages → Count chain. Nested dictionaries and arrays are skipped without error.
The parser is implemented in pure Rust with no additional dependencies. A full-featured PDF parsing crate (e.g., lopdf) was considered but would significantly increase the dependency footprint. The focused implementation is adequate for the current and anticipated near-term requirements.
PdfReadError::XrefStreamNotSupported. Many PDFs from Adobe Acrobat and LibreOffice use this format. Support is planned as a future issue.MalformedPageTree or similar.startxref) is used. Earlier versions of an incrementally updated PDF are ignored, which is the correct behavior for reading the current document state.Issue 27 added three pub(crate) methods used by the PDF Merge feature (Issue 28).
These are not part of the public API.
page_object_numbers() -> Result<Vec<u32>, PdfReadError>Walks the page tree (Catalog → Pages → Kids, recursively) and returns the object
number of every leaf /Page node in document order.
collect_closure(roots: &[u32]) -> Result<HashSet<u32>, PdfReadError>BFS from a seed set of object numbers through all indirect references (N G R)
reachable from those objects. Returns the complete transitive closure — every object
a page depends on (content streams, fonts, images, resource dicts, etc.).
raw_object_bytes(obj_num: u32) -> Result<&[u8], PdfReadError>Returns a zero-copy slice of the raw bytes for a single object, from N G obj
through (and including) endobj. Used by collect_closure and the merge function
to scan for references and to copy object data verbatim.
skip_nested_dict depth tracking: The parser uses an i32 depth counter that
decrements-then-checks (not checks-then-decrements), so the outer closing >>
is consumed correctly without accidentally consuming the enclosing structure’s >>.resolve_kids is bounded: The search for /Kids is restricted to the current
object’s bytes (up to its endobj) to prevent matching /Kids from a later object
in the file.extract_indirect_refs tokenizer: Scans for whitespace/delimiter-separated
tokens and looks for N G R triplets. May include false positives from binary
streams, which is harmless for closure computation (extra objects in the closure
are simply orphaned during merge).PdfReader::open(), PdfReader::from_bytes(), page_count(), pdf_version(). PHP bindings via PdfReader::open() and PdfReader::fromBytes().pub(crate) infrastructure for PDF merging: page_object_numbers(), collect_closure(), raw_object_bytes(). Fixed skip_nested_dict depth-tracking bug (was checking depth before decrementing, causing the enclosing dict’s closing >> to be consumed). Added bounded search in resolve_kids.