PDF to Text
Extract text content from PDF documents with word and character counts.
Reviewed by Aygul Dovletova · Last reviewed
Drop a PDF file here or click to upload
Extract text content from PDF pages
How to Extract Text from a PDF
- Drop a PDF on the upload zone or click the dashed box to browse. The file is loaded into an ArrayBuffer and handed to PDF.js.
- Wait a moment. The extractor walks every page and pulls text items from each content stream; for a typical 20-page business document this takes well under a second.
- Read the output in the right-hand textarea. Pages are separated by a blank line and a "--- Page N ---" marker so you can tell where each page begins.
- Copy or download. Use the clipboard button for quick reuse in another tab, or the download button to save a
.txtfile encoded as UTF-8 that opens cleanly in any editor.
How the Text Layer Extraction Works
This tool loads your file with pdfjs-dist, the parser that powers Firefox's built-in PDF viewer. For each page, it calls page.getTextContent(), which walks the content stream and collects every "Tj", "TJ", "'", and " operator (the PDF text-showing operators defined in ISO 32000-2 clause 9.4). Each operator produces a text item carrying its glyph string, the font dictionary it was painted with, and its transform matrix. The extractor joins those items in reading order, guessing word and line breaks from the gaps between item boxes. This is a text-layer extraction: the text must already exist as encoded characters in the content stream. If the document is a scanned image with no embedded text, there is nothing to extract; you need OCR, which this tool does not perform.
When Text Extraction Is the Right Move
- Pulling the body of a contract into a natural-language diff tool to compare two versions word-by-word.
- Feeding the text of a research paper into a reference manager or summarization pipeline.
- Copying a long quote out of a textbook PDF into a document where the selection behavior in your reader is awkward.
- Exporting meeting minutes from a decades-old internal PDF archive into a wiki that only accepts plaintext.
- Running a quick grep across a hundred-page RFC to count occurrences of a specific term before writing a summary.
- Auditing a privacy policy for specific phrases (retention period, data-sharing clauses) that compliance needs to locate quickly.
Why Extraction Sometimes Fails
The most common failure is a scanned PDF: a file that looks like text but is a sequence of image XObjects with no character codes. Extraction returns empty or near-empty output. The fix is OCR: ocrmypdf input.pdf output.pdf adds an invisible text layer under the images, and re-running this tool on the result produces real text. Other gotchas: custom font encodings that do not map glyph codes to Unicode produce garbled output (common in old CAD exports); multi-column pages can come out interleaved because PDF.js reads in stream order; ligatures like "fi" may appear as a single glyph; and tables lose their cell structure and come out as streams of aligned whitespace.
How Text Lives Inside a PDF
A PDF stores text as a sequence of marking operators inside each page's content stream. The operator Tj paints a string using the currently active font; the operator Tf selects that font; Td and Tm position the text matrix. The string bytes inside Tj are glyph codes, not Unicode; mapping them back to Unicode characters requires a ToUnicode CMap in the font dictionary per ISO 32000-2 clause 9.10. Well-authored PDFs embed that CMap and extraction yields clean Unicode. Poorly authored PDFs omit it or rely on implicit mappings that PDF.js has to guess. PDF/A (ISO 19005) mandates that every font be embedded and every character map to a Unicode code point precisely so that archival documents can be re-extracted decades later; this is why academic institutions and courts that mandate PDF/A get cleaner extraction results on long-term files.
Alternatives to Browser-Side Extraction
For digital-native PDFs, pdftotext input.pdf - from the Poppler suite is the command-line equivalent and handles Unicode cleanly. mutool draw -F txt input.pdf from MuPDF sometimes gets reading order right on multi-column layouts where pdftotext struggles. For scanned documents the canonical pipeline is ocrmypdf --language eng --force-ocr input.pdf output.pdf followed by pdftotext output.pdf -; this embeds a text layer so every future extraction works. Python's pdfminer.six adds programmatic control that pays off for table extraction. The browser tool wins for one-off extraction with no install; for any recurring batch, move to pdftotext or pdfminer.six in a script.
Frequently Asked Questions
Why is the extracted text empty for my scanned document?
Because a scanned PDF contains images of text, not text characters. There is nothing in the content stream for PDF.js to pull out; the "pages" are just big image XObjects. Run ocrmypdf on the file first to add an invisible, searchable text layer under the images, then extract again. This is a property of how the file was created, not a limitation of this tool.
Does this tool run OCR?
No, and this is a deliberate choice. OCR in the browser requires either a large WebAssembly Tesseract build or a server round-trip, both of which have real tradeoffs. The recommended pipeline for scanned documents is ocrmypdf on your machine, which produces a searchable PDF that any extractor (including this one) then handles correctly. We separate the text-layer extraction and the OCR steps to keep each tool fast and honest about what it is doing.
Is formatting preserved during extraction?
Only at the crudest level. Paragraph breaks roughly survive if the PDF uses vertical whitespace between blocks; columns, tables, bold and italic, and heading hierarchy do not. The output is plain UTF-8 text with page markers inserted at each page boundary. If you need structural extraction (tables as tables, headings as headings), use a layout-aware Python library like pdfplumber or a commercial extractor like ABBYY FineReader.
Does it handle non-Latin scripts like Chinese, Arabic, or Cyrillic?
Yes, as long as the source document embeds proper Unicode mappings for its fonts. Modern PDFs generated by Word, Google Docs, InDesign, or LaTeX produce clean Unicode for any script. Legacy documents that use custom glyph indices without a ToUnicode CMap produce garbled output regardless of language. Right-to-left scripts like Arabic and Hebrew extract in logical character order; bidirectional shaping is a viewer concern, not an extraction concern.
Why does text from a two-column PDF come out interleaved?
PDF.js reads text items in the order they appear in the content stream, which is the order the PDF producer happened to write them in. Some producers write column one start-to-end, then column two; others interleave line by line. The extractor cannot easily guess the column layout without a geometric pass, which would significantly slow the tool for the common single-column case. For critical two-column extraction, use mutool draw -F txt or pdfminer.six with layout analysis enabled.
Is the file uploaded to any server?
No. pdfjs-dist is loaded into the tab and runs against the ArrayBuffer of your file entirely on the client. There is no fetch endpoint that receives file content, no backend parser, and no service worker intercept. You can check by disconnecting your network after the page loads; extraction continues to work. The usual analytics script running on the site captures page views, not file bytes.
What are the "--- Page N ---" markers in the output?
They are page-boundary separators inserted by the extractor to help you map extracted text back to the source. Each page runs through getTextContent independently, and the text from one page ends where the marker for the next page begins. If you post-process the output with a script, you can split on the regex /--- Page \\d+ ---/ to iterate page-by-page.
Can I extract text from password-protected PDFs?
Not in this UI. PDF.js can accept a password parameter to getDocument, but this page does not prompt for one so the extractor refuses encrypted files at load. Decrypt first with the PDF Unlocker (once the encryption module ships) or with qpdf --decrypt --password=YOURPASS input.pdf output.pdf locally, then run extraction on the plaintext copy.
Why do ligatures like "fi" and "fl" come out as a single character?
When a font embeds the ligature as a single glyph with a single glyph code, and the font dictionary does not map that glyph back to the two-character Unicode sequence, the extractor sees what the PDF actually stores: one character. Well-authored modern PDFs include a ToUnicode CMap that expands "fi" to "fi", and those extract correctly. If your output is full of ligature glyphs, run it through a post-processor that normalizes Unicode (NFKC) to decompose them.
How does this compare to pdftotext?
pdftotext from the Poppler project is the battle-tested CLI equivalent and is usually a little better on reading order for complex layouts, because it does a geometric sort pass that PDF.js does not replicate fully. It is also much faster in batch. This browser tool wins for single-file use with zero install and for situations where the file cannot leave the machine. For any recurring or programmatic extraction, call pdftotext from a shell script or Python wrapper.
More PDF Tools
Image to PDF
Combine multiple JPG and PNG images into a single PDF document.
Open toolPDF Compressor
Compress PDFs with Ghostscript image downsampling. Pick a quality preset. Files auto-deleted after 15 minutes.
Open toolPDF Merge (Server-Side)
Merge up to 20 PDFs into a single document on our EU servers using qpdf. Files auto-deleted after 15 minutes. Handles large or password-cleared inputs the in-browser merger cannot.
Open toolPDF Merger
Merge multiple PDF files into a single document with drag-and-drop reordering.
Open toolPDF Page Reorder
Rearrange pages in a PDF document with a visual drag-and-drop interface.
Open toolPDF Password Protect
Add AES-256 password protection to PDF files via qpdf. Files auto-deleted after 15 minutes.
Open tool