Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
first reply went off-list On Sat, May 9, 2009 at 3:34 PM, David Vasconcelos < david-D010uk0En6A+WRpjb9m9WFaTQe2KTcn/@public.gmane.org> wrote: > I've got an HP OfficeJet 5610 and I'm interested in finding a > replacement for the bundled scanner software. > > The HP software (Windows) can produce something it calls a "searchable > PDF." I really like this format because it's combines an image of the > document with OCR'd text. > > The text gets embedded in such a way that you can select/copy text > directly from acroread, evince, etc. > > I've tried gscan2pdf and it comes pretty close to what I'm looking for. > However... > > 1. The OCR'd text gets embedded differently, so you can't actually > select/copy the OCR'd text from a PDF viewer. > > 2. The OCR back-ends for gscan2pdf (tesserract and GOCR) seem to have > trouble with multiple columns of text, or things like pay-stubs where > the text doesn't flow in paragraphs. The free HP software seems to > handle this without a problem. > > So, I've been scanning from Windows. I'd really like to find an > alternative. > > Any suggestions? Thanks! > I've only used XSane or Kooka http://kooka.kde.org/ with the normal OCR engines. And it has been a long time since I scanned anything. After I read this review http://groundstate.ca/ocr , I learned about OCRopus. Looks very interesting: http://code.google.com/p/ocropus/ http://sites.google.com/site/ocropus/install-mercurial This review http://www.linux.com/archive/articles/57222 explains how Tesseract (http://code.google.com/p/tesseract-ocr/) from HP, now Google, changed the landscape and provided high accuracy, but I think it's either incorporated with, or superceeded by OCRopus There are commercial applications that can be run on linux/unix, but the cost is in the thousands of dollars: http://vividata.com/be_xtr_pricing.html Please let us know what else you find out. ~ Greg -- Greg Rundlett Web Developer - Initiative in Innovative Computing http://iic.harvard.edu camb 617-384-5872 nbpt 978-225-8302 m. 978-764-4424 -skype/aim/irc/twitter freephile http://profiles.aim.com/freephile
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |