This time of year is often used to do various cleanup chores. Today is “get rid of old paperwork day”. And I decided to try and go paperless. One thing I learned how to do today is to use Tesseract and Ghostscript to archive searchable PDFs from scans.

Scanning documents is the easy part. The problem then is finding them. Of course, it’s useful to manually sort them, but I wanted to have searchable PDFs that I could index on the family server.

Using Tesseract to create searchable PDFs

I dug a little bit on the Internet, and I found that Tesseract could do exactly what I wanted. Below is a short script, which I called png2pdf, that indexes several scanned files into a PDF, or into multiple indexed PDF files.

#!/bin/bash

name="$1"
shift

echo Turning $* into ${name}.pdf
mkdir -p processed "$(dirname "$name")"

# process each page
for f in "$@"; do
  # extract text
  tesseract -l fra+eng -psm 3 "$f" "${f%.*}" pdf &
done

# Wait for the tesseract processes to complete
wait

# combine all pages back to a single file
if [ "$name" != "-" ]; then
    gs -dCompatibilityLevel=1.4 -dNOPAUSE -dQUIET -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite -sOutputFile="${name}".pdf "${@/.*/.pdf}"
    mv "${@/.*/.pdf}" processed/
fi

mv "$@" processed/

Usage model

There are two ways to use this script:

  1. Combining multiple image files, e.g. JPEG, into a single PDF document. An example would be png2pdf TargetName Image1.jpg Image2.jpg Image3.jpg.
  2. Generating multiple one-page PDF files when the first argument is -. An example would be png2pdf - Image1.jpg Image2.jpg Image3.jpg.

The script moves the files that were transformed into the processed/ directory. You can erase it later if you want.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s