Fw: What laser printers do you like - Ricoh & Linux

David Kramer david at thekramers.net
Sat Jul 15 10:42:23 EDT 2006


James R. Van Zandt wrote:
> I have put together a sizable collection of IEEE papers, but they're
> image-only PDFs, making them hard to search.
> 
> Is there a convenient way to add the metadata to the PDF files
> themselves, along with (say) a hand-typed abstract and OCR of the
> rest, so the whole thing can be indexed by something like beagle
> <http://beaglewiki.org/Main_Page>?  
> 
>               - Jim Van Zandt

I would start by running pdftotext on them, then using regular
expressions to pull metadata out of the text versions.

Oddly enough, this is the basis of one of the projects I'm working on at
 Aptima.  Pulling metadata from information coming from many sources in
many formats, tracking the metadata, and grouping documents into that
metadata.



More information about the Discuss mailing list