DockittDockitt

How to Make a Scanned PDF Searchable with OCR

When you scan a physical document, the result is a PDF that looks like text but is actually just an image of text. You cannot search it, you cannot copy text from it, and screen readers cannot read it aloud. This is where OCR comes in. OCR analyses the image of each page, identifies the characters and words, and adds an invisible text layer on top of the image. The result is a PDF that looks identical to the original scan but now behaves like a proper text document.

Dockitt uses ocrmypdf with the Tesseract engine, one of the most accurate open-source OCR systems available, to process your scanned PDFs.

How to run OCR on a scanned PDF online - step by step

  1. Open the Dockitt OCR PDF tool in your browser.
  2. Click 'Choose PDF' or drag and drop your scanned PDF into the upload area.
  3. Click 'Run OCR' and wait while Tesseract analyses each page and adds the text layer. This may take 30 seconds to a few minutes depending on the number of pages.
  4. Click 'Download' to save the OCR-processed PDF to your device.
  5. Open the file in any PDF viewer and press Ctrl+F on Windows or Cmd+F on Mac to search for text. If OCR worked correctly, the search should find words on the page.

What OCR does and does not do

Understanding exactly what OCR changes in your PDF helps set the right expectations for the result.

What affects OCR accuracy

OCR accuracy varies significantly depending on the quality of the original scan and the nature of the content. Here is what has the most impact.

OCR vs text-based PDFs

Not all PDFs need OCR. Understanding the difference between a scanned PDF and a text-based PDF helps you decide whether OCR is the right tool for your situation.

Using OCR before converting to Word

One of the most common reasons to run OCR is to prepare a scanned PDF for conversion to an editable Word document.

Processing large scanned documents

OCR is computationally intensive. Large scanned documents take significantly longer to process than small ones.

Common problems

The OCR result contains lots of errors and garbled text.

OCR accuracy depends heavily on scan quality. Low-resolution scans, pages with heavy shadows, skewed text, or faded ink produce poor results. If possible, re-scan the document at 300 DPI or higher with good lighting and straight pages. OCR errors affect search accuracy but do not change the visual appearance of the document.

The OCR process is very slow.

OCR is computationally intensive. A 50-page document may take several minutes to process. This is normal. If the tool times out on very large files, split the PDF into smaller sections and run OCR on each part separately, then merge the results.

After OCR, the file is much larger than the original.

Adding a text layer increases the file size, especially for high-resolution scans. After OCR processing, run the file through the Dockitt Compress PDF tool to reduce the size without losing the text layer.

The PDF already has text but OCR added a duplicate layer.

Only use the OCR tool on PDFs that are purely image-based scans where text selection and search do not work. If your PDF already has selectable text, running OCR again is unnecessary.

Search still does not work after OCR.

Try opening the processed file in a different PDF viewer such as Adobe Acrobat Reader. Some PDF viewers do not support text layer search. Also confirm the OCR actually processed by trying to click on and select text on the page.

Related tools

PDF to WordConvert an OCR-processed PDF to an editable Word document.Compress PDFReduce file size after OCR processing.Rotate PDFFix page orientation before running OCR.Split PDFSplit large documents before OCR processing.

FAQ

What is OCR and how does it work?

OCR stands for Optical Character Recognition. It is a technology that analyses images containing text and identifies individual characters using pattern matching and machine learning models. Dockitt uses Tesseract, an open-source OCR engine originally developed by HP and now maintained by Google, combined with ocrmypdf to integrate the text layer cleanly into your PDF.

Will OCR change how my PDF looks?

No. The visual appearance of the PDF remains identical to the original scan. OCR adds an invisible text layer underneath the page images. The scanned images themselves are not altered. When you open the processed PDF it looks exactly the same as before, but now supports text search, copy-paste, and screen reader access.

Can OCR handle handwritten text?

Standard OCR engines including Tesseract are optimised for printed text and struggle significantly with handwriting, especially cursive. Handwriting recognition requires specialised models that are not part of this tool. For handwritten documents, results will be unreliable.

Does OCR work on PDFs with multiple columns or complex layouts?

Tesseract handles multi-column layouts reasonably well for simple two-column documents. Complex magazine-style layouts, tables, or mixed text-and-image pages may produce text that is out of order in the text layer. The visual appearance remains correct, only the order of text in the hidden layer may be inconsistent for complex layouts.

Is the OCR text layer used for anything besides searching?

Yes. The text layer enables copy-pasting text from the PDF, makes the document accessible to screen readers for visually impaired users, allows text extraction for further processing, and improves the document's indexability by search engines if it is published online.

How long does OCR processing take?

Processing time depends on the number of pages and the resolution of the scanned images. A 10-page document typically takes 30 to 60 seconds. A 50-page document may take several minutes. Keep the browser tab open while processing is in progress.

Can I run OCR on a PDF that already has some text pages and some scanned pages?

Yes. ocrmypdf is designed to skip pages that already have a text layer and only process image-based pages. Running OCR on a mixed PDF is safe and will add text layers only to the pages that need them.

Try it now

Ready to make your scanned PDF searchable? Use the free Dockitt OCR tool below.

OCR PDFAdd a searchable text layer to scanned PDFs