Boston Linux & Unix (BLU) Home | Calendar | Mail Lists | List Archives | Desktop SIG | Hardware Hacking SIG
Wiki | Flickr | PicasaWeb | Video | Maps & Directions | Installfests | Keysignings
Linux Cafe | Meeting Notes | Blog | Linux Links | Bling | About BLU

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Good Word doc -> plain text conversion



Dan Ritter wrote:
| antiword is the usual candidate. Every one of Google's first ten
| results for that are relevant.

Yeah, I thought of that, too, but I was hoping there might be something  that
does  a  better  job.   In  one of my current sample .doc files, for example,
antiword produces the curious table entry:

| CUTRIGHT, Susan        |11 Arlington Road    | (781)209-9877          |
|                        |Waltham, MA  02453   |susan.cutright at ASPENTECH|
|                        |                     |.com                    |

Note the "wrapping" of the email address, with the ".com" on a separate line.
When Word displays this on a Windows screen, this wrapping doesn't happen.
The 3rd column strings are actually centered, and the email address is
whole.

After a bit of exploring, I found that the -w option works to get a wider
"page" size, and this entry actually works, but others in the file don't.
When I tried things like "antiword -w 200 <file>", it decreases the width
to 138, which seems to be the widest "page" that it believes possible. So
later in the same file, I get the following 138-char-wide chunk:

|SNIDERMAN, Rebecca                            |MB 1794 Brandeis University P O Box      |             rsnider-1FONPbNgvBv2fBVCVOL8/A at public.gmane.org              |
|                                              |549110                                   |                                               |
|                                              |Brandeis University                      |                                               |
|                                              |Waltham, MA  02454-9110                  |                                               |

Note the bizarre 4-line address, with just "549110" on the second  line.   Of
course, the sensible thing would be to remove the first "Brandeis University"
from the address, but that's what's in the file, and there are other  entries
with  quite long addresses.  I tried to write a perl parser that would handle
all the entries in this file and a couple of others, and after  an  afternoon
of  hacking  at  it,  I  still  haven't  quite succeeded.  Such spurious line
wrapping, including things like splitting ".net" into ".n" and  "et"  in  one
case, can be one of the trickier kinds of damage to fix.

I wonder if there's a clean fix to this sort of problem?

(And why a max of 138 chars?  That's a rather bizarre number.)


--
   _'
   O
 <:#/>  John Chambers
   +   <jc-8FIgwK2HfyJMuWfdjsoA/w at public.gmane.org>
  /#\  <jc1742-Re5JQEeQqe8AvxtiuMwx3w at public.gmane.org>
  | |






BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org