PDF Tools
PDF OCR - Extract Text from Scanned PDFs
Extract text from scanned, image-based PDFs using OCR entirely in your browser. Outputs plain text, Markdown, searchable PDF, or developer hOCR/JSON. No uploads.
Drop a scanned PDF here or
PDF files up to 200 MB
What is PDF OCR?
Optical Character Recognition (OCR) converts scanned or image-based PDF pages into machine-readable text. Unlike PDFs produced by word processors, scanned documents contain only embedded images with no underlying text layer. OCR reconstructs that text layer so you can search, copy, and edit it.
How the OCR engine works
This tool uses Tesseract.js, a WebAssembly port of Google's Tesseract OCR engine. Each PDF page is rendered to a high-resolution canvas (216 DPI equivalent) by PDF.js, then passed to Tesseract for recognition. Tesseract uses an LSTM neural network trained on printed text to identify characters and words. All processing runs in your browser - no page images are ever sent to a server.
Choosing the right language
Selecting the correct document language is the single biggest factor in OCR accuracy. Tesseract loads a language-specific trained data file (~4 MB) from a CDN on first use. If your document contains a mix of languages, choose the dominant language. For Latin-script languages not in the common list, try the full language dropdown below the common options.
Understanding confidence scores
Each page receives a confidence score (0-100%) indicating how certain Tesseract was about its character recognition. Scores above 80% generally indicate clean, reliable output. Scores below 50% suggest the page may be blurry, contain unusual fonts, or be in a language that does not match your selection. Low-confidence pages are highlighted in amber or red.
Searchable PDF output
The Searchable PDF format embeds an invisible text layer at each recognised word's position in the original PDF. The visual appearance of the document is unchanged, but PDF viewers (Acrobat, Chrome, Preview) can now search, highlight, and copy text. Word positions are approximate - they are calculated by mapping OCR canvas pixel coordinates back to PDF point space. Minor misalignment is expected and does not affect searchability.
Markdown output
Markdown output uses font-size hints from the hOCR data to infer headings: large text becomes
# H1 or ## H2 headings, and significant vertical gaps become paragraph
breaks. This works well on cleanly structured documents like reports and articles, but complex multi-column
layouts or decorative fonts may produce inaccurate heading classification. Always review Markdown
output manually.
Privacy and offline use
Your PDF is never uploaded. The only network request is the one-time download of the Tesseract language data file from a public CDN. After that download, OCR runs entirely offline in your browser. No account is required, no data is stored, and no analytics are collected on your document content.