Good Word doc -> plain text conversion

David Kramer david-8uUts6sDVDvs2Lz0fTdYFQ at public.gmane.org
Sun Sep 19 13:02:00 EDT 2010


On 09/19/2010 03:38 PM, jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org wrote:
> Anyone here have advice on programs (scriptable and  usable
> on linux) that convert Word docs to plain text?
> 
> I've been googling, of course, but most of the  things  I'm
> finding start with "1.  Load the file into Word". This is a
> good clue that the scheme  probably  can't  be  used  in  a
> script that's running on a linux system.  ;-)

If you want an automated solution. how about writing it in Java?

http://poi.apache.org/
The Apache POI Project's mission is to create and maintain Java APIs for
manipulating various file formats based upon the Office Open XML
standards (OOXML) and Microsoft's OLE 2 Compound Document format (OLE2).
In short, you can read and write MS Excel files using Java. In addition,
you can read and write MS Word and MS PowerPoint files using Java.
Apache POI is your Java Excel solution (for Excel 97-2008). We have a
complete API for porting other OOXML and OLE2 formats and welcome others
to participate.

OLE2 files include most Microsoft Office files such as XLS, DOC, and PPT
as well as MFC serialization API based file formats. The project
provides APIs for the OLE2 Filesystem (POIFS) and OLE2 Document
Properties (HPSF).

Here are some other solutions:
http://www.linux.com/archive/feed/52385





More information about the Discuss mailing list