[Discuss] Converting "rich" (MIME) email to plain text

Chuck Anderson cra at WPI.EDU
Wed Feb 17 13:18:35 EST 2016


On Wed, Feb 17, 2016 at 11:39:22AM -0500, Michael Tiernan wrote:
> I'm sure that I'm not the first who tried to find an easy way to
> filter a piece of email so that only the plain text comes out.
> 
> I can find lots of things about going plain to HTML but I've not
> seen anything that allows you to just extract the "Content-Type:
> text/plain" section of an email.
> 
> Any pointers available? I don't want to try and reinvent the
> reinvented wheel.

Here is what I use with Mutt to get lightly-formatted text and
unobfuscated links.  It isn't perfect, but it works acceptably 90% of
the time and it avoids downloading any remote links which was my
primary goal.

>grep mailcap .muttrc
set mailcap_path = ~/.muttmailcap
set mailcap_sanitize

>cat .muttmailcap 
text/html; /home/cra/bin/striphtml.pl; copiousoutput
text/calendar; /home/cra/bin/vcalendar-filter; copiousoutput

>cat ~/bin/striphtml.pl
#!/usr/bin/perl -w
use HTML::Strip;
use HTML::LinkExtor;
use HTML::Entities qw/decode_entities/;
use URI::Escape qw/uri_unescape/;
use Encode qw/from_to/;

undef $/;
my $html_text = <ARGV>;

my $charset = 'UTF-8';
if ($html_text =~ /\ncontent-type:\s+text\/html;\s+charset=(.*)/i) {
    $charset = $1;
    $charset =~ s/\"//g;
} else {
    print "no char set\n";
    #print $html_text;
}

$html_text =~ s/<br>/\n/gi;
$html_text =~ s/<p>/\n/gi;
my $hs = HTML::Strip->new();
my $stripped_text = $hs->parse($html_text);

my $decoded_text = decode_entities($stripped_text);
$decoded_text =~ s/\n\s*\n/\n\n/g;
$decoded_text =~ s/\n\n+/\n\n/g;
$decoded_text =~ s/\240/ /g;
$decoded_text =~ s/\r//g;

#$decoded_text = decode($charset, $decoded_text);
###from_to($decoded_text, $charset, 'UTF-8');

my $hl = HTML::LinkExtor->new();
$hl->parse($html_text);
my @links = $hl->links;

print "Charset: $charset\n";
print "Message:\n\n";
print $decoded_text;

print "\nLinks:\n\n";
foreach my $link (@links) {
  printf "%-7s %-15s %s\n", $$link[0], $$link[1],
    uri_unescape($$link[2]);
}



More information about the Discuss mailing list