BLU Discuss list archive
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Discuss] Please help with a sed script (Bill Horne)
- Subject: [Discuss] Please help with a sed script (Bill Horne)
- From: worley at alum.mit.edu (Dale R. Worley)
- Date: Tue, 25 May 2021 21:08:36 -0400
- In-reply-to: <mailman.15596.1621946796.7345.discuss@lists.blu.org> (discuss-request@driftwood.blu.org)
> From: Bill Horne <malassimilation at gmail.com> > Here's the table-of-contents from a typical day: > > * 1 - [telecom] Can robocalls be tracked? - "bob prohaska" > <bp at remove-this.www.zefox.net> > * 2 - Re: [telecom] Can robocalls be tracked? - Bill Horne > ? <malQRMassimilation at gmail.com> > * 3 - [telecom] Verizon Media debuts ad-targeting solution without > identifiers > ? - Moderator <telecomdigestsubmissions at remove-this.telecom-digest.org> First off, you're not specifying how the line breaks work. Are the line breaks we see here actually in the ToC text, or are they just an artifact of how you inserted it into this e-mail message? I ask because line-breaking is one of the harder things to get sed to change, so we should be clear about it. > And here's what I'd like to change it to, using (if possible) sed: > > (tr)(td)Can robocalls be tracked?(/td)(/tr) > (tr)(td)Re: Can robocalls be tracked?(/td)(/tr) > (tr)(td)Verizon Media debuts ad-targeting solution without > identifiers(/td)(/tr) > > ("less-than" and "greater-than" symbols have been changed to > parens here for obvious reasons.) It's not quite clear why, as < and > are transparent in ASCII e-mail, except for the first column. > Things to note: > > 1. The Subjects lines vary in length, and may contain hyphens. > 2. The name and email of the contributor is also published with the > actual post, further on in each digest, so it doesn't have to appear > in the Table of Contents. > 3. The "m" option of sed, which the manual says will do a multi-line > "s" command, doesn't appear to work on the OS I'm using, which is > Ubuntu 16 LTS. You should do "sed --version" and report what it says. The above example suggests that the title is separated from the contributor by " - ", but you don't say that. The contributor appears to be optional. And you don't specify whether the sequence " - " may also appear as part of the title, which means parsing the two apart is ambiguous. The final "Moderator" line is distinguished how? It appears that item lines start with "\* [1-9][0-9]* - ". Does the Moderator line start with " \? - "? That is, how do we distinguish it from a continuation of the preceding title? As others have noted, it's likely easier to generate the HTML form you want from an earlier stage of processing, one where there's a data structure that rigidly differentiates each title, and ideally, separates the titles from the contributors. But if you can't do that, the first step is to really nail down how you parse this text apart conceptually. After that, it's much easier to implement the transformation. Dale
- Prev by Date: [Discuss] work search question
- Next by Date: [Discuss] Please help with a sed script
- Previous by thread: [Discuss] Please help with a sed script
- Next by thread: [Discuss] Avoiding paying Windows license in the US
- Index(es):