Question 1

Why is the extracted text empty for my scanned document?

Accepted Answer

Because a scanned PDF contains images of text, not text characters. There is nothing in the content stream for PDF.js to pull out; the "pages" are just big image XObjects. Run ocrmypdf on the file first to add an invisible, searchable text layer under the images, then extract again. This is a property of how the file was created, not a limitation of this tool.

Question 2

Does this tool run OCR?

Accepted Answer

No, and this is a deliberate choice. OCR in the browser requires either a large WebAssembly Tesseract build or a server round-trip, both of which have real tradeoffs. The recommended pipeline for scanned documents is ocrmypdf on your machine, which produces a searchable PDF that any extractor (including this one) then handles correctly. We separate the text-layer extraction and the OCR steps to keep each tool fast and honest about what it is doing.

Question 3

Is formatting preserved during extraction?

Accepted Answer

Only at the crudest level. Paragraph breaks roughly survive if the PDF uses vertical whitespace between blocks; columns, tables, bold and italic, and heading hierarchy do not. The output is plain UTF-8 text with page markers inserted at each page boundary. If you need structural extraction (tables as tables, headings as headings), use a layout-aware Python library like pdfplumber or a commercial extractor like ABBYY FineReader.

Question 4

Does it handle non-Latin scripts like Chinese, Arabic, or Cyrillic?

Accepted Answer

Yes, as long as the source document embeds proper Unicode mappings for its fonts. Modern PDFs generated by Word, Google Docs, InDesign, or LaTeX produce clean Unicode for any script. Legacy documents that use custom glyph indices without a ToUnicode CMap produce garbled output regardless of language. Right-to-left scripts like Arabic and Hebrew extract in logical character order; bidirectional shaping is a viewer concern, not an extraction concern.

Question 5

Why does text from a two-column PDF come out interleaved?

Accepted Answer

PDF.js reads text items in the order they appear in the content stream, which is the order the PDF producer happened to write them in. Some producers write column one start-to-end, then column two; others interleave line by line. The extractor cannot easily guess the column layout without a geometric pass, which would significantly slow the tool for the common single-column case. For critical two-column extraction, use mutool draw -F txt or pdfminer.six with layout analysis enabled.

Question 6

Is the file uploaded to any server?

Accepted Answer

No. pdfjs-dist is loaded into the tab and runs against the ArrayBuffer of your file entirely on the client. There is no fetch endpoint that receives file content, no backend parser, and no service worker intercept. You can check by disconnecting your network after the page loads; extraction continues to work. The usual analytics script running on the site captures page views, not file bytes.

Question 7

What are the "--- Page N ---" markers in the output?

Accepted Answer

They are page-boundary separators inserted by the extractor to help you map extracted text back to the source. Each page runs through getTextContent independently, and the text from one page ends where the marker for the next page begins. If you post-process the output with a script, you can split on the regex /--- Page \d+ ---/ to iterate page-by-page.

Question 8

Can I extract text from password-protected PDFs?

Accepted Answer

Not in this UI. PDF.js can accept a password parameter to getDocument, but this page does not prompt for one so the extractor refuses encrypted files at load. Decrypt first with the PDF Unlocker (once the encryption module ships) or with qpdf --decrypt --password=YOURPASS input.pdf output.pdf locally, then run extraction on the plaintext copy.

Question 9

Why do ligatures like "fi" and "fl" come out as a single character?

Accepted Answer

When a font embeds the ligature as a single glyph with a single glyph code, and the font dictionary does not map that glyph back to the two-character Unicode sequence, the extractor sees what the PDF actually stores: one character. Well-authored modern PDFs include a ToUnicode CMap that expands "ﬁ" to "fi", and those extract correctly. If your output is full of ligature glyphs, run it through a post-processor that normalizes Unicode (NFKC) to decompose them.

Question 10

How does this compare to pdftotext?

Accepted Answer

pdftotext from the Poppler project is the battle-tested CLI equivalent and is usually a little better on reading order for complex layouts, because it does a geometric sort pass that PDF.js does not replicate fully. It is also much faster in batch. This browser tool wins for single-file use with zero install and for situations where the file cannot leave the machine. For any recurring or programmatic extraction, call pdftotext from a shell script or Python wrapper.

PDF to Text

How to Extract Text from a PDF

How the Text Layer Extraction Works

When Text Extraction Is the Right Move

Why Extraction Sometimes Fails

How Text Lives Inside a PDF

Alternatives to Browser-Side Extraction

Frequently Asked Questions

Related tools

More PDF Tools

Image to PDF

PDF Compressor

PDF Merge (Server-Side)

PDF Merger

PDF Page Reorder

PDF Password Protect