Good Word doc -> plain text conversion

Gordon Marx gcmarx-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org
Sun Sep 19 20:01:01 EDT 2010


You know who's totally psyched about this email? Susan Cutright and
Rebecca Sniderman...

On Sun, Sep 19, 2010 at 8:01 PM,  <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org> wrote:
> Dan Ritter wrote:
> | antiword is the usual candidate. Every one of Google's first ten
> | results for that are relevant.
>
> Yeah, I thought of that, too, but I was hoping there might be something  that
> does  a  better  job.   In  one of my current sample .doc files, for example,
> antiword produces the curious table entry:
>
> | CUTRIGHT, Susan        |11 Arlington Road    | (781)209-9877          |
> |                        |Waltham, MA  02453   |susan.cutright at ASPENTECH|
> |                        |                     |.com                    |
>
> Note the "wrapping" of the email address, with the ".com" on a separate line.
> When Word displays this on a Windows screen, this wrapping doesn't happen.
> The 3rd column strings are actually centered, and the email address is
> whole.
>
> After a bit of exploring, I found that the -w option works to get a wider
> "page" size, and this entry actually works, but others in the file don't.
> When I tried things like "antiword -w 200 <file>", it decreases the width
> to 138, which seems to be the widest "page" that it believes possible. So
> later in the same file, I get the following 138-char-wide chunk:
>
> |SNIDERMAN, Rebecca                            |MB 1794 Brandeis University P O Box      |             rsnider-1FONPbNgvBv2fBVCVOL8/A at public.gmane.org              |
> |                                              |549110                                   |                                               |
> |                                              |Brandeis University                      |                                               |
> |                                              |Waltham, MA  02454-9110                  |                                               |
>
> Note the bizarre 4-line address, with just "549110" on the second  line.   Of
> course, the sensible thing would be to remove the first "Brandeis University"
> from the address, but that's what's in the file, and there are other  entries
> with  quite long addresses.  I tried to write a perl parser that would handle
> all the entries in this file and a couple of others, and after  an  afternoon
> of  hacking  at  it,  I  still  haven't  quite succeeded.  Such spurious line
> wrapping, including things like splitting ".net" into ".n" and  "et"  in  one
> case, can be one of the trickier kinds of damage to fix.
>
> I wonder if there's a clean fix to this sort of problem?
>
> (And why a max of 138 chars?  That's a rather bizarre number.)
>
>
> --
>   _'
>   O
>  <:#/>  John Chambers
>   +   <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org>
>  /#\  <jc1742-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
>  | |
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>






More information about the Discuss mailing list