Japanese Mail Archive Character-Set Transcoder

I have no idea if this will be helpful to anyone else, but I’ll just throw this out there for the search engines to pick up, just in case. I wrote a Perl script to take a Sendmail mbox archive of email messages, and transcode the text of all their bodies, and the Subject and From headers, to UTF-8. This script may be had here. It reads in the archive on standard input, and spits the transcoded archive on standard output.

I’m subscribed to a few Japanese-language mailing lists (well, more accurately, one Japanese-language mailing list, a daily mail of the Slashdot Japan headlines, and Google Alerts for wget and tmux). The idea for this was for me to get Japanese practice by reading regular Japanese content on subjects I’m interested in.

The problem is, I just don’t read it when it comes in. The best time for me to practice Japanese is on the train, during my work commute. Which is why I have a Kindle 3, so I can browse Japanese websites on the train. So, I need these mails web-accessible. No problem, I can use a mail web archive tool, like hypermail.

But hypermail doesn’t like dealing with mbox files that consist of messages that are in various incompatible encodings; some of my mail arrives in UTF-8 (unicode), and others in ISO-2022-JP (a popular encoding for Japanese-language text). Hypermail doesn’t deal with encoded characters in the mail headers, and also I didn’t know what character encoding to configure Apache to tell the browser, because it differed from one mail to the next.

So I wrote this transcoder tool in Perl. It just scans through all the mails, decodes the Subject and From headers (currently leaves the others), and transcodes all ISO-2022-JP (or anything that’s not UTF-8), so all the messages use the same encoding, and I just configure Apache to use UTF-8 for all of them. The best part is that, now that the actual content and the server-specified character encoding agree consistently, I can use online tools like Hiragana Megane to process this web mail archive, and automatically provide pronunciations for words I’m not very familiar with.

The script requires the following Perl modules: MIME::Tools (for parsing email format), Mail::Mbox::MessageParser (for parsing mbox), and Text::Iconv (to handle the transcoding).