Not a new problem, but something I thought would have been easier using open source software, thus I’m documenting my solution. With some research and experimentation, I adapted this script into something that will take a collection of images of text (e.g. pages from a book or a paper) and convert them into a PDF you can search. You will need to install some other packages, and my instructions here assume you’re using homebrew on a Mac, but the script should be adaptable to any platform that can run tesseract, imagemagick, and ghostscript.

I will say that it’s is WAY slower than the hard-coded OCR functionality on some scanner / printers I’ve seen. Not sure why. And FWIW, the process of editing and cropping scanned pages still takes a lot of time.

FWIW, you can also use ghostscript to add author/title metadata:

Comments are closed.