Fw: What laser printers do you like - Ricoh & Linux
John Abreau
john.abreau at zuken.com
Mon Jul 17 10:59:42 EDT 2006
David Kramer wrote:
> James R. Van Zandt wrote:
>> I have put together a sizable collection of IEEE papers, but they're
>> image-only PDFs, making them hard to search.
>>
>> Is there a convenient way to add the metadata to the PDF files
>> themselves, along with (say) a hand-typed abstract and OCR of the
>> rest, so the whole thing can be indexed by something like beagle
>> <http://beaglewiki.org/Main_Page>?
>>
>> - Jim Van Zandt
>
> I would start by running pdftotext on them, then using regular
> expressions to pull metadata out of the text versions.
>
> Oddly enough, this is the basis of one of the projects I'm working on at
> Aptima. Pulling metadata from information coming from many sources in
> many formats, tracking the metadata, and grouping documents into that
> metadata.
These are image-only PDFs; each page of the PDF is simply a big image.
pdftotext won't find any text in them.
I haven't found a good OCR solution for Linux. I have an adequate OCR
package for Windows, but I don't see any way to automate it; each
document has to be processed by hand. And the results are somewhat
adequate as metadata, but you'd still need to review and correct
non-trivial amounts of the resulting text in order to achieve decent
searchability.
--
John Abreau
IT Manager
Zuken USA
238 Littleton Rd., Suite 100
Westford, MA 01886
T: 978-392-1777 F: 978-692-4725
M: 978-764-8934
E: John.Abreau at zuken.com W: www.zuken.com
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
URL: <http://lists.blu.org/pipermail/discuss/attachments/20060717/864950e9/attachment.sig>
More information about the Discuss
mailing list