Document scanning and "searchable" PDFs
Greg Rundlett (freephile)
greg-SfI3QVg0eaJl57MIdRCFDg at public.gmane.org
Thu May 14 09:54:49 EDT 2009
first reply went off-list
On Sat, May 9, 2009 at 3:34 PM, David Vasconcelos <
david-D010uk0En6A+WRpjb9m9WFaTQe2KTcn/@public.gmane.org> wrote:
> I've got an HP OfficeJet 5610 and I'm interested in finding a
> replacement for the bundled scanner software.
>
> The HP software (Windows) can produce something it calls a "searchable
> PDF." I really like this format because it's combines an image of the
> document with OCR'd text.
>
> The text gets embedded in such a way that you can select/copy text
> directly from acroread, evince, etc.
>
> I've tried gscan2pdf and it comes pretty close to what I'm looking for.
> However...
>
> 1. The OCR'd text gets embedded differently, so you can't actually
> select/copy the OCR'd text from a PDF viewer.
>
> 2. The OCR back-ends for gscan2pdf (tesserract and GOCR) seem to have
> trouble with multiple columns of text, or things like pay-stubs where
> the text doesn't flow in paragraphs. The free HP software seems to
> handle this without a problem.
>
> So, I've been scanning from Windows. I'd really like to find an
> alternative.
>
> Any suggestions? Thanks!
>
I've only used XSane or Kooka http://kooka.kde.org/ with the normal OCR
engines. And it has been a long time since I scanned anything.
After I read this review http://groundstate.ca/ocr , I learned about
OCRopus. Looks very interesting:
http://code.google.com/p/ocropus/
http://sites.google.com/site/ocropus/install-mercurial
This review http://www.linux.com/archive/articles/57222 explains how
Tesseract (http://code.google.com/p/tesseract-ocr/) from HP, now Google,
changed the landscape and provided high accuracy, but I think it's either
incorporated with, or superceeded by OCRopus
There are commercial applications that can be run on linux/unix, but the
cost is in the thousands of dollars:
http://vividata.com/be_xtr_pricing.html
Please let us know what else you find out.
~ Greg
--
Greg Rundlett
Web Developer - Initiative in Innovative Computing
http://iic.harvard.edu
camb 617-384-5872
nbpt 978-225-8302
m. 978-764-4424
-skype/aim/irc/twitter freephile
http://profiles.aim.com/freephile
More information about the Discuss
mailing list