Home
| Calendar
| Mail Lists
| List Archives
| Desktop SIG
| Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings Linux Cafe | Meeting Notes | Linux Links | Bling | About BLU |
David Kramer wrote: > James R. Van Zandt wrote: >> I have put together a sizable collection of IEEE papers, but they're >> image-only PDFs, making them hard to search. >> >> Is there a convenient way to add the metadata to the PDF files >> themselves, along with (say) a hand-typed abstract and OCR of the >> rest, so the whole thing can be indexed by something like beagle >> <http://beaglewiki.org/Main_Page>? >> >> - Jim Van Zandt > > I would start by running pdftotext on them, then using regular > expressions to pull metadata out of the text versions. > > Oddly enough, this is the basis of one of the projects I'm working on at > Aptima. Pulling metadata from information coming from many sources in > many formats, tracking the metadata, and grouping documents into that > metadata. These are image-only PDFs; each page of the PDF is simply a big image. pdftotext won't find any text in them. I haven't found a good OCR solution for Linux. I have an adequate OCR package for Windows, but I don't see any way to automate it; each document has to be processed by hand. And the results are somewhat adequate as metadata, but you'd still need to review and correct non-trivial amounts of the resulting text in order to achieve decent searchability. -- John Abreau IT Manager Zuken USA 238 Littleton Rd., Suite 100 Westford, MA 01886 T: 978-392-1777 F: 978-692-4725 M: 978-764-8934 E: John.Abreau at zuken.com W: www.zuken.com -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 254 bytes Desc: OpenPGP digital signature URL: <http://lists.blu.org/pipermail/discuss/attachments/20060717/864950e9/attachment.sig>
BLU is a member of BostonUserGroups | |
We also thank MIT for the use of their facilities. |