About PDF Text Extraction

This tool uses Mozilla's PDF.js library to parse PDF files and extract the embedded text layer page by page. The extraction runs entirely in your browser - your PDF file is never sent to any server.

Note: scanned PDFs without an embedded OCR text layer will appear blank. Use an OCR tool first to add a text layer to such documents.

Why text extraction sometimes fails or looks wrong

PDF is a page-description format, not a document format. Text extraction is surprisingly complex because PDFs store characters as positioned glyphs on a canvas, not as a logical reading-order stream. Common issues include:

Multi-column layouts: extracted text may merge columns left-to-right instead of reading each column top-to-bottom. Manual cleanup is usually needed.
Tables: cell content may be extracted row-by-row, column-by-column, or in arbitrary order depending on how the PDF was produced.
Ligatures and special characters: some PDFs encode "fi", "fl", and other ligatures as single glyphs with no Unicode mapping, producing gaps or garbled characters on extraction.
Right-to-left text (Arabic, Hebrew): reading order may be reversed in the extracted output.

OCR for scanned PDFs

Scanned PDFs contain only rasterized images - there is no text layer to extract. To extract text from a scanned document, you first need Optical Character Recognition (OCR). Free options include:

Tesseract: open-source OCR engine; available as a command-line tool or embedded in applications like Adobe Acrobat and ABBYY FineReader.
Google Drive: open a PDF in Google Drive and it will OCR it automatically; then copy the resulting text from the Docs viewer.
Adobe Acrobat Reader: the free reader can OCR scanned PDFs and add a text layer, making them searchable and extractable.

Output format options

After extraction you can copy the text directly or download it as a plain .txt file. For structured documents, consider pasting into a text editor with column handling, or importing into a spreadsheet if the source was a table-heavy PDF.