[Discuss] Please help with a sed script

E. William Horne malassimilation at gmail.com
Tue May 25 18:35:24 EDT 2021


THANK YOU for the scripts.

I apologize: I didn't write my request more clearly:

 1. As with the discuss list, users whom subscribe to the Telecom Digest
    mailing list can choose to receive either each email that is sent to
    the mailing list, or to a "Digest" version, with all the emails for
    a day concatenated into a single "Digest" email. I receive a copy of
    The Telecom Digest's "Digest" edition, which is sent to me from a
    SYMPA email reflector at iecc.com in New York. The email message I
    used to test the scripts quoted here is at
    http://telecom.csail.mit.edu/archives/back.issues/recent.single.issues/test-e145.txt
    <http://telecom.csail.mit.edu/archives/back.issues/recent.single.issues/test-e145.txt>
    - it is a verbatim copy of the email, taken from my mbox after it
    arrived, with a few edits to help prevent spam.
 2. Since some viewers prefer to get the Telecom Digest online, I
    prepare an HTML version of the daily digest email. To do that, I've
    been doing a lot of edits by hand, and I need a more automated
    method. To that end, I'm asking for help to write either a sed or
    awk or whatever-works script, which will convert the daily "Digest"
    email into an HTML page with that day's messages on it.
      * I will write a table-of-contents with the subjects from all the
        emails in it.
      * The email User ID's and addresses are to be removed before
        outputting the table.
      * There are other edits, but they not nearly as hard as the Table
        of Contents, so I'm asking for help with that

Today's Telecom Digest Table of Contents looks like this:

Table of contents:

* 1 - Re: [telecom] Cell phone bills too high? Here are some that start at
   just   $10 a month - "John Levine" <johnl at taugh.com>
* 2 - Re: [telecom] Cell phone bills too high? Here are some that start at
   just $10  a month - Bill Horne <malQassRimiMlation at gmail.com>
* 3 - [telecom] Opinion: CTL is going downhill fast - Moderator
<telecomdigestsubmissions at remove-this.telecom-digest.org>

I tried the sed script:

sed 's|^\* [0-9]* - \(.*\)\[telecom\] \(.*\) - 
.*$|<tr><td>\1\2</td></tr>|g' test.txt >t1.txt

The result was:

* 1 - Re: [telecom] Cell phone bills too high? Here are some that start at
   just   $10 a month - "John Levine" <johnl at taugh.com>
* 2 - Re: [telecom] Cell phone bills too high? Here are some that start at
   just $10  a month - Bill Horne <malQassRimiMlation at gmail.com>
<tr><td>Opinion: CTL is going downhill fast</td></tr>
<telecomdigestsubmissions at remove-this.telecom-digest.org>

So, it looks like the sed option is going to need some refinement. ;-)

 1. I could try testing if the line started with "* 0-9 - [telecom]" and
    ended with ">", and then figure out if there were extra hyphen in it
    and edit it using the last one as a delimiter.
 2. If the line started with "* 0-9 - [telecom]", but didn't end with
    ">", then I'd try to write it out to the "hold" buffer, and read in
    the next line to see if /that/ /line/ ended with ">", and if it did,
    I'd like to combine the two lines in the hold area, move the hold
    area to the pattern space, and edit it there as if it were a single
    line.
 3. I haven't thought of how to deal with three-line entries yet. I need
    a bigger thinking cap for this.

I then tried the awk script:

awk  '/^   \* [0-9]* - .*\[telecom\]/{if (NR>1) print ""} {printf $0} END{print ""}' <test-e145.txt

and got this (edited for brevity) output:

: 5783Lines: 144telecom digest Tue, 25 May 2021Table of contents:* 1 - Re: [telecom] Cell phone bills too high? Here are some that start at  just   $10 a month - "John Levine" <johnl at taugh.com>* 2 - Re: [telecom] Cell phone bills too high? Here are some that start at  just $10  a month - Bill Horne <malQassRimiMlation at gmail.com>* 3 - [telecom] Opinion: CTL is going downhill fast - Moderator  <telecomdigestsubmissions at remove-this.telecom-digest.org>----------------------------------------------------------------------

Which is better in a way: if awk can produce continuous output, without newline characters, then it can probably edit the input as if it were one continuous line, which would make things easier. I'll have to find the awk manual and do more studying.

My thanks to Mr. Galperin for his help. I need all I can get!

Bill Horne


On 5/25/2021 9:54 AM, Gregory Galperin wrote:
> awk  '/^   \* [0-9]* - .*\[telecom\]/{if (NR>1) print ""} {printf $0} END{print ""}' | \
> sed 's|^   \* [0-9]* - \(.*\)\[telecom\] \(.*\) - .*$|<tr><td>\1\2</td></tr>|g'
>
> notes:
>   * I assumed the 3 spaces before the * were part of the data (rather than
>     just formatting by you in this particular email)
>   * other than that, whitespace is considered unimportant and left alone,
>     since html doesn't care.  if you want to squeeze redundant whitespace,
>     | tr -s ' '
>   * hyphens (and even the string " - ") can be in the subject, no problem
>   * if the string " - " is in the free text part of the email name, then any
>     part of that free text before the " - " is considered as part of the subject
>   * if the subject has the string [telecom] in it a second time or more,
>     only the last [telecom] gets eaten -- so e.g. a subject
> 	Re: [telecom] Why do all subject lines have [telecom] at the front?
>     becomes
> 	Re: [telecom] Why do all subject lines have at the front?
>   * the string [telecom] can be in the email field, no problem
>     (but note that if the email field has both [telecom] and a " - " after it,
>      the " - " makes everything before that show up in the subject, and then
>      in the subject only the last [telecom] before the " - " gets eaten)
>   * on the off chance the wrapping breaks a subject so that the continuation
>     starts with a *, has one space and then a number and then " - " and has
>     the string [telecom] somewhere on that line, this will consider that
>     continuation to instead be a new message.
>
> maybe try it on a couple months of digests and look through the results?
>
> --grg
>
>
> On Tue, May 25, 2021 at 02:21:53AM -0400, Bill Horne wrote:
>> Thanks for reading this: I appreciate your time.
>>
>> I'm the Moderator of The Telecom Digest, which is the oldest e-zine on the
>> Internet.
>>
>> The readers send in pointers to articles of interest, and each day, other
>> readers whom subscribe with the "digest" option receive an email with all
>> the previous day's stories.
>>
>> Here's the table-of-contents from a typical day:
>>
>>     * 1 - [telecom] Can robocalls be tracked? - "bob prohaska"
>>     <bp at remove-this.www.zefox.net>
>>     * 2 - Re: [telecom] Can robocalls be tracked? - Bill Horne
>>        <malQRMassimilation at gmail.com>
>>     * 3 - [telecom] Verizon Media debuts ad-targeting solution without
>>     identifiers
>>        - Moderator<telecomdigestsubmissions at remove-this.telecom-digest.org>
>>
>> And here's what I'd like to change it to, using (if possible) sed:
>>
>>         (tr)(td)Can robocalls be tracked?(/td)(/tr)
>>         (tr)(td)Re: Can robocalls be tracked?(/td)(/tr)
>>         (tr)(td)Verizon Media debuts ad-targeting solution without
>>         identifiers(/td)(/tr)
>>
>>         ("less-than" and "greater-than" symbols have been changed to
>>         parens here for obvious reasons.)
>>
>> Things to note:
>>
>> 1. The Subjects lines vary in length, and may contain hyphens.
>> 2. The name and email of the contributor is also published with the
>>     actual post, further on in each digest, so it doesn't have to appear
>>     in the Table of Contents.
>> 3. The "m" option of sed, which the manual says will do a multi-line
>>     "s" command, doesn't appear to work on the OS I'm using, which is
>>     Ubuntu 16 LTS.
>>
>> Up until now, I've been doing this change every day, with emacs macros and
>> the rest by-hand. I want to automate a lot more of the daily work, so I'm
>> hoping that there's a way to get Linux sed to do that. I don't need sed per
>> se: if awk or some other utility would be a better choice, please tell me
>> about that possible solution instead.
>>
>> Thanks you again.
>>
>> Bill
>>
>> -- 
>> Bill Horne
>>
>> _______________________________________________
>> Discuss mailing list
>> Discuss at lists.blu.org
>> http://lists.blu.org/mailman/listinfo/discuss


More information about the Discuss mailing list