[Discuss] Please help with a sed script

Wed May 26 01:56:33 EDT 2021

E. William Horne wrote:
> I'm asking for help to write either a sed or
> awk or whatever-works script, which will convert the daily "Digest"
> email into an HTML page with that day's messages on it.

I agree with others that you avoid a lot of corner cases if you can
capture this info "upstream" rather than parsing it from an already
formatted digest, but I assume you have your reasons for preferring that
approach.

> Today's Telecom Digest Table of Contents looks like this:
> 
> Table of contents:
> 
> * 1 - Re: [telecom] Cell phone bills too high? Here are some that start at
>   just   $10 a month - "John Levine" <johnl at taugh.com>
> * 2 - Re: [telecom] Cell phone bills too high? Here are some that start at
>   just $10  a month - Bill Horne <malQassRimiMlation at gmail.com>
> * 3 - [telecom] Opinion: CTL is going downhill fast - Moderator
> <telecomdigestsubmissions at remove-this.telecom-digest.org>

% ./toc.pl test-e145.txt
<tr><td>Re: Cell phone bills too high? Here are some that start at just
$10 a month</td></tr>
<tr><td>Re: Cell phone bills too high? Here are some that start at just
$10 a month</td></tr>
<tr><td>Opinion: CTL is going downhill fast</td></tr>

or...

% curl -o -
http://telecom.csail.mit.edu/archives/back.issues/recent.single.issues/test-e145.txt
| ./toc.pl

> So, it looks like the sed option is going to need some refinement. ;-)
> 
> 1. I could try testing if the line started with "* 0-9 - [telecom]" and
>    ended with ">", and then figure out if there were extra hyphen in it
>    and edit it using the last one as a delimiter.
> 2. If the line started with "* 0-9 - [telecom]", but didn't end with
>    ">", then I'd try to write it out to the "hold" buffer, and read in
>    the next line to see if /that/ /line/ ended with ">", and if it did,
>    I'd like to combine the two lines in the hold area, move the hold
>    area to the pattern space, and edit it there as if it were a single
>    line.
> 3. I haven't thought of how to deal with three-line entries yet. I need
>    a bigger thinking cap for this.

A Perl solution is below. I think it addresses the items you list above.

It's written for clarity and ease of modification rather than brevity.
For example, I didn't attempt to do this as a "one liner", and instead
have used multiple statements. Similarly the process of unwrapping the
per-message TOC entries is handled in one loop, while another deals with
reformatting the subjects.

The text matching patterns are written to be resilient to minor
formatting changes, which means it never looks specifically for a space
(instead one or more whitespace characters) or a single digit. In other
circumstances I'd allow for optional whitespace before the "*" character
that delimits TOC entries, but in this case requiring that to appear at
the start of the line improves the chances that the parser won't get
tripped up by the associated pattern appearing elsewhere on the line.

Also note the code makes use of the HTML::Entities module (which you can
find in the libhtml-parser-perl package on Ubuntu 16.04) to escape HTML
entities that appear in the subject lines.

The unwrapping code relies on the pattern
<start_of_line>*<whitespace><digits><whitespace>-<whitespace> as the
delimiter for splitting up the existing table of contents into one line
per message. It should handle TOC entries that wrap across an arbitrary
number of lines.

Initially I was going to strip out all the newlines and process it as
one string, but that makes it slightly more possible that the pattern
might appear in a subject line.

It then iterates over the TOC entries and for each of them it uses a
"greedy" expression to split each entry on the last " - " found. From
there it processes the subject portion to remove repeating whitespace
(not really necessary, as Gregory points out), strip the [telecom] tag,
escape the HTML entities, and output as HTML table rows.

Gregory Galperin wrote:
>   * if the string " - " is in the free text part of the email name,
> then any part of that free text before the " - " is considered
> as part of the subject

Likewise.

>   * if the subject has the string [telecom] in it a second time or more,
>     only the last [telecom] gets eaten -- so e.g. a subject
>     Re: [telecom] Why do all subject lines have [telecom] at the front?
>     becomes
>     Re: [telecom] Why do all subject lines have at the front?

My code strips the first occurrence in the subject and ignores the rest.

>   * the string [telecom] can be in the email field, no problem
>     (but note that if the email field has both [telecom] and a " - "
>     after it,  the " - " makes everything before that show up in
>     the subject, and then in the subject only the last [telecom] 
>     before the " - " gets eaten)

Likewise.

>   * on the off chance the wrapping breaks a subject so that the
>     continuation starts with a *, has one space and then a number
>     and then " - " and has the string [telecom] somewhere on that
>     line, this will consider that continuation to instead be a
>     new message.

Similar, but [telecom] is presumed to be optional. My code requires that
the "* N - " sequence appears at the start of a line, and given
continuation lines are always prefixed with several spaces, this
sequence embedded in the subject shouldn't be mistaken for a delimiter,
even if it starts a continued line.

 -Tom

-------------------------------------------------------------------------------
#!/usr/bin/perl
use warnings;
use strict;

use HTML::Entities;

# read STDIN in "paragraphs"
$/='';

# grab the first several paragraphs of the message
my $header = <>;
my $title = <>;
my $toc_heading = <>;
my $toc_body = <>;

#print "raw\n$toc_body\n\n";

# process a TOC body consisting of lines like
#
# * 1 - Re: [telecom] Cell phone bills too high? Here are...
#  ...
#  just   $10 a month - "John Levine" <johnl at taugh.com>
# * 2 - Re: [telecom] Cell phone bills too high? Here are...

# undo the per-message line wrapping and strip line numbers/prefix
my $unwrapped_toc = '';
foreach my $toc_line ( split(/^\*\s+\d+\s+-\s*/m, $toc_body) ) {
    next if !$toc_line;
    $toc_line =~ tr/\n\r//d; # strip line endings
    $unwrapped_toc .= $toc_line . "\n";
}
$toc_body = $unwrapped_toc;

# further process lines like
#
# Re: [telecom] Cell phone bills too high... - <email>
# [telecom] Opinion: CTL is going downhill fast - <email>

foreach my $toc_line ( split(/\n/, $toc_body) ) {
    #print $toc_line, "\n";
    chomp($toc_line); # strip newline

    # when splitting the subject from the author's email address a
    # "greedy" expression is used, which splits on the last " - " found
    # (This could be a problem if that string appears in the authors
    # real name.)
    my ($subject, $from) = ($toc_line =~ m/(.*)\s+-\s+(.*)$/);

    $subject =~ tr/ \t/ /s; # collapse repeating whitespace to a single
space
    $subject =~ s/\[telecom\]\s*//; # strip unneeded [telecom] tag

    # escape HTML entities and format output as HTML
    my $html_subject = HTML::Entities::encode_entities($subject);
    print "<tr><td>$html_subject</td></tr>\n";
}
-------------------------------------------------------------------------------

-- 
Tom Metro
The Perl Shop, Newton, MA, USA
"Predictable On-demand Perl Consulting."
http://www.theperlshop.com/