Good Word doc -> plain text conversion

jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org
Mon Sep 20 00:01:25 EDT 2010


Dan Ritter wrote:
| antiword is the usual candidate. Every one of Google's first ten
| results for that are relevant.

Yeah, I thought of that, too, but I was hoping there might be something  that
does  a  better  job.   In  one of my current sample .doc files, for example,
antiword produces the curious table entry:

| CUTRIGHT, Susan        |11 Arlington Road    | (781)209-9877          |
|                        |Waltham, MA  02453   |susan.cutright at ASPENTECH|
|                        |                     |.com                    |

Note the "wrapping" of the email address, with the ".com" on a separate line.
When Word displays this on a Windows screen, this wrapping doesn't happen.
The 3rd column strings are actually centered, and the email address is
whole.

After a bit of exploring, I found that the -w option works to get a wider
"page" size, and this entry actually works, but others in the file don't.
When I tried things like "antiword -w 200 <file>", it decreases the width
to 138, which seems to be the widest "page" that it believes possible. So
later in the same file, I get the following 138-char-wide chunk:

|SNIDERMAN, Rebecca                            |MB 1794 Brandeis University P O Box      |             rsnider-1FONPbNgvBv2fBVCVOL8/A at public.gmane.org              |
|                                              |549110                                   |                                               |
|                                              |Brandeis University                      |                                               |
|                                              |Waltham, MA  02454-9110                  |                                               |

Note the bizarre 4-line address, with just "549110" on the second  line.   Of
course, the sensible thing would be to remove the first "Brandeis University"
from the address, but that's what's in the file, and there are other  entries
with  quite long addresses.  I tried to write a perl parser that would handle
all the entries in this file and a couple of others, and after  an  afternoon
of  hacking  at  it,  I  still  haven't  quite succeeded.  Such spurious line
wrapping, including things like splitting ".net" into ".n" and  "et"  in  one
case, can be one of the trickier kinds of damage to fix.

I wonder if there's a clean fix to this sort of problem?

(And why a max of 138 chars?  That's a rather bizarre number.)


--
   _'
   O
 <:#/>  John Chambers
   +   <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org>
  /#\  <jc1742-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
  | |





More information about the Discuss mailing list