Does this work on PDFs that already have a text layer?

Yes, but the PDF Text Extractor tool is faster and more accurate for PDFs that already contain a machine-readable text layer. This OCR tool is designed for image-only scanned documents.

Tesseract runs a neural network inference pass on each rendered page image inside a WebAssembly module. Expect 2-8 seconds per page depending on page size, image quality, and your device speed.

Is my document uploaded to a server?

Your PDF is never sent to any server. All rendering and OCR processing runs locally in your browser. When you click Run OCR, the Tesseract language data file (~4 MB) is downloaded from a CDN on first use - this is the only network request.

What languages are supported?

Over 90 languages via Tesseract. Select the document language from the dropdown before running OCR. The common languages group covers the most frequently used options; the full list below it includes less common languages. Selecting the correct language is important for accuracy.

How accurate is the Markdown output?

Markdown formatting is heuristic - it uses font-size hints (x_size) from the hOCR output to infer headings and paragraph breaks. It works well on cleanly structured documents but may misclassify headings on complex layouts. Manual review is always recommended.

What is the Searchable PDF output?

The tool embeds an invisible text layer at each recognised word position using pdf-lib, making the PDF selectable and searchable in any viewer without changing its visual appearance. Word positions are approximate due to coordinate mapping between the OCR canvas and PDF point space.

What does the hOCR and JSON output contain?

hOCR is the raw Tesseract HTML output with per-word bounding boxes and confidence scores, useful for further processing. JSON is a structured version of the same data with word text, bounding box coordinates, and confidence values per page.

Can I process multi-language documents?

The current implementation supports one language per OCR run. For documents mixing two languages, choose the dominant language or run OCR twice with different language selections.

PDF OCR - Extract Text from Scanned PDFs

What is PDF OCR?

Optical Character Recognition (OCR) converts scanned or image-based PDF pages into machine-readable text. Unlike PDFs produced by word processors, scanned documents contain only embedded images with no underlying text layer. OCR reconstructs that text layer so you can search, copy, and edit it.

How the OCR engine works

This tool uses Tesseract.js, a WebAssembly port of Google's Tesseract OCR engine. Each PDF page is rendered to a high-resolution canvas (216 DPI equivalent) by PDF.js, then passed to Tesseract for recognition. Tesseract uses an LSTM neural network trained on printed text to identify characters and words. All processing runs in your browser - no page images are ever sent to a server.

Choosing the right language

Selecting the correct document language is the single biggest factor in OCR accuracy. Tesseract loads a language-specific trained data file (~4 MB) from a CDN on first use. If your document contains a mix of languages, choose the dominant language. For Latin-script languages not in the common list, try the full language dropdown below the common options.

Understanding confidence scores

Each page receives a confidence score (0-100%) indicating how certain Tesseract was about its character recognition. Scores above 80% generally indicate clean, reliable output. Scores below 50% suggest the page may be blurry, contain unusual fonts, or be in a language that does not match your selection. Low-confidence pages are highlighted in amber or red.

Searchable PDF output

The Searchable PDF format embeds an invisible text layer at each recognised word's position in the original PDF. The visual appearance of the document is unchanged, but PDF viewers (Acrobat, Chrome, Preview) can now search, highlight, and copy text. Word positions are approximate - they are calculated by mapping OCR canvas pixel coordinates back to PDF point space. Minor misalignment is expected and does not affect searchability.

Markdown output

Markdown output uses font-size hints from the hOCR data to infer headings: large text becomes # H1 or ## H2 headings, and significant vertical gaps become paragraph breaks. This works well on cleanly structured documents like reports and articles, but complex multi-column layouts or decorative fonts may produce inaccurate heading classification. Always review Markdown output manually.

Privacy and offline use

Your PDF is never uploaded. The only network request is the one-time download of the Tesseract language data file from a public CDN. After that download, OCR runs entirely offline in your browser. No account is required, no data is stored, and no analytics are collected on your document content.