BLU Discuss list archive

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Fw: What laser printers do you like - Ricoh & Linux

Subject: Fw: What laser printers do you like - Ricoh & Linux
From: john.abreau at zuken.com (John Abreau)
Date: Mon, 17 Jul 2006 10:59:42 -0400
In-reply-to: <44B8FECF.9000402@thekramers.net>
References: <001e01c6a6cc$ac3fcf40$0600a8c0@SAVIN.RFG.COM> <44B7AFE7.5050206@zuken.com> <E1G1lA6-0006kg-00@vanzandt.comcast.net> <44B8FECF.9000402@thekramers.net>

David Kramer wrote:
> James R. Van Zandt wrote:
>> I have put together a sizable collection of IEEE papers, but they're
>> image-only PDFs, making them hard to search.
>>
>> Is there a convenient way to add the metadata to the PDF files
>> themselves, along with (say) a hand-typed abstract and OCR of the
>> rest, so the whole thing can be indexed by something like beagle
>> <http://beaglewiki.org/Main_Page>?  
>>
>>               - Jim Van Zandt
> 
> I would start by running pdftotext on them, then using regular
> expressions to pull metadata out of the text versions.
> 
> Oddly enough, this is the basis of one of the projects I'm working on at
>  Aptima.  Pulling metadata from information coming from many sources in
> many formats, tracking the metadata, and grouping documents into that
> metadata.

These are image-only PDFs; each page of the PDF is simply a big image.
pdftotext won't find any text in them.

I haven't found a good OCR solution for Linux. I have an adequate OCR
package for Windows, but I don't see any way to automate it; each
document has to be processed by hand. And the results are somewhat
adequate as metadata, but you'd still need to review and correct
non-trivial amounts of the resulting text in order to achieve decent
searchability.


-- 
John Abreau
IT Manager
Zuken USA
238 Littleton Rd., Suite 100
Westford, MA 01886
T: 978-392-1777            F: 978-692-4725
M: 978-764-8934
E: John.Abreau at zuken.com  W: www.zuken.com

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 254 bytes
Desc: OpenPGP digital signature
URL: <http://lists.blu.org/pipermail/discuss/attachments/20060717/864950e9/attachment.sig>

References:
- Fw: What laser printers do you like - Ricoh & Linux
  - From: vince.mchugh at yahoo.com (Yahoo Mail)
- Fw: What laser printers do you like - Ricoh & Linux
  - From: john.abreau at zuken.com (John Abreau)
- Fw: What laser printers do you like - Ricoh & Linux
  - From: jrvz at comcast.net (James R. Van Zandt)
- Fw: What laser printers do you like - Ricoh & Linux
  - From: david at thekramers.net (David Kramer)

Prev by Date: Torture/fun
Next by Date: ppp callback help
Previous by thread: Fw: What laser printers do you like - Ricoh & Linux
Next by thread: could I get help with an install ?
Index(es):
- Date
- Thread

Boston Linux & Unix / webmaster@blu.org