Boston Linux & UNIX was originally founded in 1994 as part of The Boston Computer Society. We meet on the third Wednesday of each month at the Massachusetts Institute of Technology, in Building E51.

BLU Discuss list archive


[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Discuss] Converting "rich" (MIME) email to plain text



On Wed, Feb 17, 2016 at 11:39:22AM -0500, Michael Tiernan wrote:
> I'm sure that I'm not the first who tried to find an easy way to
> filter a piece of email so that only the plain text comes out.
> 
> I can find lots of things about going plain to HTML but I've not
> seen anything that allows you to just extract the "Content-Type:
> text/plain" section of an email.
> 
> Any pointers available? I don't want to try and reinvent the
> reinvented wheel.

Here is what I use with Mutt to get lightly-formatted text and
unobfuscated links.  It isn't perfect, but it works acceptably 90% of
the time and it avoids downloading any remote links which was my
primary goal.

>grep mailcap .muttrc
set mailcap_path = ~/.muttmailcap
set mailcap_sanitize

>cat .muttmailcap 
text/html; /home/cra/bin/striphtml.pl; copiousoutput
text/calendar; /home/cra/bin/vcalendar-filter; copiousoutput

>cat ~/bin/striphtml.pl
#!/usr/bin/perl -w
use HTML::Strip;
use HTML::LinkExtor;
use HTML::Entities qw/decode_entities/;
use URI::Escape qw/uri_unescape/;
use Encode qw/from_to/;

undef $/;
my $html_text = <ARGV>;

my $charset = 'UTF-8';
if ($html_text =~ /\ncontent-type:\s+text\/html;\s+charset=(.*)/i) {
    $charset = $1;
    $charset =~ s/\"//g;
} else {
    print "no char set\n";
    #print $html_text;
}

$html_text =~ s/<br>/\n/gi;
$html_text =~ s/<p>/\n/gi;
my $hs = HTML::Strip->new();
my $stripped_text = $hs->parse($html_text);

my $decoded_text = decode_entities($stripped_text);
$decoded_text =~ s/\n\s*\n/\n\n/g;
$decoded_text =~ s/\n\n+/\n\n/g;
$decoded_text =~ s/\240/ /g;
$decoded_text =~ s/\r//g;

#$decoded_text = decode($charset, $decoded_text);
###from_to($decoded_text, $charset, 'UTF-8');

my $hl = HTML::LinkExtor->new();
$hl->parse($html_text);
my @links = $hl->links;

print "Charset: $charset\n";
print "Message:\n\n";
print $decoded_text;

print "\nLinks:\n\n";
foreach my $link (@links) {
  printf "%-7s %-15s %s\n", $$link[0], $$link[1],
    uri_unescape($$link[2]);
}



BLU is a member of BostonUserGroups
BLU is a member of BostonUserGroups
We also thank MIT for the use of their facilities.

Valid HTML 4.01! Valid CSS!



Boston Linux & Unix / webmaster@blu.org