Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Good Word doc -> plain text conversion



You know who's totally psyched about this email? Susan Cutright and
Rebecca Sniderman...

On Sun, Sep 19, 2010 at 8:01 PM,  <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org> wrote:
> Dan Ritter wrote:
> | antiword is the usual candidate. Every one of Google's first ten
> | results for that are relevant.
>
> Yeah, I thought of that, too, but I was hoping there might be something ?that
> does ?a ?better ?job. ? In ?one of my current sample .doc files, for example,
> antiword produces the curious table entry:
>
> | CUTRIGHT, Susan ? ? ? ?|11 Arlington Road ? ?| (781)209-9877 ? ? ? ? ?|
> | ? ? ? ? ? ? ? ? ? ? ? ?|Waltham, MA ?02453 ? |susan.cutright at ASPENTECH|
> | ? ? ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? |.com ? ? ? ? ? ? ? ? ? ?|
>
> Note the "wrapping" of the email address, with the ".com" on a separate line.
> When Word displays this on a Windows screen, this wrapping doesn't happen.
> The 3rd column strings are actually centered, and the email address is
> whole.
>
> After a bit of exploring, I found that the -w option works to get a wider
> "page" size, and this entry actually works, but others in the file don't.
> When I tried things like "antiword -w 200 <file>", it decreases the width
> to 138, which seems to be the widest "page" that it believes possible. So
> later in the same file, I get the following 138-char-wide chunk:
>
> |SNIDERMAN, Rebecca ? ? ? ? ? ? ? ? ? ? ? ? ? ?|MB 1794 Brandeis University P O Box ? ? ?| ? ? ? ? ? ? rsnider-1FONPbNgvBv2fBVCVOL8/A at public.gmane.org ? ? ? ? ? ? ?|
> | ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?|549110 ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? | ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? |
> | ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?|Brandeis University ? ? ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? |
> | ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?|Waltham, MA ?02454-9110 ? ? ? ? ? ? ? ? ?| ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? |
>
> Note the bizarre 4-line address, with just "549110" on the second ?line. ? Of
> course, the sensible thing would be to remove the first "Brandeis University"
> from the address, but that's what's in the file, and there are other ?entries
> with ?quite long addresses. ?I tried to write a perl parser that would handle
> all the entries in this file and a couple of others, and after ?an ?afternoon
> of ?hacking ?at ?it, ?I ?still ?haven't ?quite succeeded. ?Such spurious line
> wrapping, including things like splitting ".net" into ".n" and ?"et" ?in ?one
> case, can be one of the trickier kinds of damage to fix.
>
> I wonder if there's a clean fix to this sort of problem?
>
> (And why a max of 138 chars? ?That's a rather bizarre number.)
>
>
> --
> ? _'
> ? O
> ?<:#/> ?John Chambers
> ? + ? <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org>
> ?/#\ ?<jc1742-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
> ?| |
> _______________________________________________
> Discuss mailing list
> Discuss-mNDKBlG2WHs at public.gmane.org
> http://lists.blu.org/mailman/listinfo/discuss
>







BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org