February | 2011 | micah.cowan.name

Alright, I’ve been working my ass off lately on this project, which I haven’t wanted to say a lot about until i had something reasonable to show for it. I now feel that I’ve reached this point (barely, and depending on your point of view).

Niwt is a project that aims to (eventually) reproduce most of the functionality of GNU Wget (and some additional), but with a radically different design philosophy—namely, that it is built entirely around Unix pipelines, and facilities to easily swap out or extend every existing piece of functionality with an alternative (or additional) program that offers equivalent (or improved) functionality. It is expected that this will result in a big trade-off between, the relative efficiency, lower resource consumption, and portability to other systems that Wget enjoys (which Niwt will certainly not), versus extreme and relatively easy customization. If a highly customizable tool is what you need, Niwt may (when it’s finished) fit your needs; if efficiency and general leanness are what is called for, it most likely will not.

In terms of functionality, Niwt has virtually nothing to offer at the moment. It can download files. It doesn’t have Wget’s automatic connection recovery (yet), nor does it have timestamping or recursion (yet). The point of this pre-pre-pre-prerelease is not to demonstrate what Niwt does, but what it could eventually do, and how it will allow you to do it. Every bit of Niwt’s operation is open and transparent to the user, and modifiable in every way.

To find out more about this project, please visit http://niwt.addictivecode.org/Niwt, and especially http://niwt.addictivecode.org/TryingOutNiwt to get an idea of how it works (though that page is best enjoyed with your copy of niwt already installed, which you can get from http://niwt.addictivecode.org/InstallingNiwt).

I’ve set up an IRC channel, #niwt @ irc.freenode.net, where I’ll try to be available when I can, and a users’ discussion mailing list at http://addictivecode.org/mailman/listinfo/niwt-users/ .

Try it out and let me know what you think!

Niwt’s source code is free and open source software, and is available under the MIT (simple BSD-style) license.

I have no idea if this will be helpful to anyone else, but I’ll just throw this out there for the search engines to pick up, just in case. I wrote a Perl script to take a Sendmail mbox archive of email messages, and transcode the text of all their bodies, and the Subject and From headers, to UTF-8. This script may be had here. It reads in the archive on standard input, and spits the transcoded archive on standard output.

I’m subscribed to a few Japanese-language mailing lists (well, more accurately, one Japanese-language mailing list, a daily mail of the Slashdot Japan headlines, and Google Alerts for wget and tmux). The idea for this was for me to get Japanese practice by reading regular Japanese content on subjects I’m interested in.

The problem is, I just don’t read it when it comes in. The best time for me to practice Japanese is on the train, during my work commute. Which is why I have a Kindle 3, so I can browse Japanese websites on the train. So, I need these mails web-accessible. No problem, I can use a mail web archive tool, like hypermail.

But hypermail doesn’t like dealing with mbox files that consist of messages that are in various incompatible encodings; some of my mail arrives in UTF-8 (unicode), and others in ISO-2022-JP (a popular encoding for Japanese-language text). Hypermail doesn’t deal with encoded characters in the mail headers, and also I didn’t know what character encoding to configure Apache to tell the browser, because it differed from one mail to the next.

So I wrote this transcoder tool in Perl. It just scans through all the mails, decodes the Subject and From headers (currently leaves the others), and transcodes all ISO-2022-JP (or anything that’s not UTF-8), so all the messages use the same encoding, and I just configure Apache to use UTF-8 for all of them. The best part is that, now that the actual content and the server-specified character encoding agree consistently, I can use online tools like Hiragana Megane to process this web mail archive, and automatically provide pronunciations for words I’m not very familiar with.

The script requires the following Perl modules: MIME::Tools (for parsing email format), Mail::Mbox::MessageParser (for parsing mbox), and Text::Iconv (to handle the transcoding).

micah.cowan.name/blog/

The random ramblings of Micah Cowan. Programmer, musician, typesetting enthusiast, gamer…

Monthly Archives: February 2011

Announcement: Niwt (Nifty Integrated Web Tools)

Japanese Mail Archive Character-Set Transcoder