Document scanning and "searchable" PDFs

David Vasconcelos david-D010uk0En6A+WRpjb9m9WFaTQe2KTcn/ at public.gmane.org
Sat May 9 15:34:57 EDT 2009


I've got an HP OfficeJet 5610 and I'm interested in finding a
replacement for the bundled scanner software.

The HP software (Windows) can produce something it calls a "searchable
PDF."  I really like this format because it's combines an image of the
document with OCR'd text.

The text gets embedded in such a way that you can select/copy text
directly from acroread, evince, etc.

I've tried gscan2pdf and it comes pretty close to what I'm looking for.
 However...

1. The OCR'd text gets embedded differently, so you can't actually
select/copy the OCR'd text from a PDF viewer.

2. The OCR back-ends for gscan2pdf (tesserract and GOCR) seem to have
trouble with multiple columns of text, or things like pay-stubs where
the text doesn't flow in paragraphs.  The free HP software seems to
handle this without a problem.

So, I've been scanning from Windows.  I'd really like to find an
alternative.

Any suggestions?  Thanks!

David





More information about the Discuss mailing list