@kyanny's blog

印刷は人間に対し市場を作り出し、国民軍を創設する方法も教えたのであった - マーシャル・マクルーハン「グーデンベルグの銀河系」

今日のPerlMonks

くそ・・・回答できるチャンスかもと思って調べてるうちにベストな回答をポストされた。


Basic problem: perl 5.8 seems to refuse to decode Korean UTF-8 correctly.

I have an e-mail sending program that reads UTF-8 Korean (and Japanese) from a database and then formats it to an e-mail. I already have this routine working well for iso-8859.

I thought all I would have to do is change the MIME tags to utf-8 and have it print out the raw utf-8 characters, but perl 5.8.* is complaining I have a Wide character as part of a function call when I call encode_qp (for converting the subject line to quoted printed format according to RFC2047 standards).. The program then dies. I tried to follow the recommendations of 'man perlunicode' and converted the database strings to utf-8 flagged status using:

$subjecttxt = Encode::decode_utf8($subjecttxt);
$encodedsubject = encode_qp($subjecttxt);

This resulted in a blank string.. When I changed it to use:
encode("utf8",$subjecttxt,Encode::FB_CROAK)
and it told me it couldn't convert the utf8.. thinking it was invalid.. I verified it was valid and was even able to view it correctly in Linux (with LANG=en_US.utf-8 setting).

I also went to extra step of verifying the first 3 bytes of the subject line was a valid code.. The UTF-8 sequence was "EC A0 9C" which converts to C81C in Unicode, which is a valid codepoint.

I read further into a 'README.perl' in the lib/perl5/5.8.*/unicore area that mentioned downloading a couple of large files (Unihan.txt and NormalizeTesting.txt), which I did, and followed the one step of 'perl mktables -makelist'... This build process seemed to work but it still complains about the invalid translations..

Is there more that I need to do to get a successful utf8 decode?
Is there a workaround way I could pass the raw utf8 directly to encode_qp() function without it complaining?

Thanks much in advance.

意訳以下だがニュアンスを日本語でかいてみると、要するにUTF-8の文字列がDBにあってそれをMIMEエンコードできないって話。で、Wide characterって書いてあったので、Perl5.8内部処理でのUTF-8フラグが落ちてないって話なんじゃないかと思ってその方面で調べつつテストスクリプトとか書いてたら、MIME::QuotedPrintモジュールのドキュメントに「そのものズバリ」の例が書いてあるぞとコメントがつきがっくり。

The documentation for MIME::QuotedPrint talks about doing this:

Perl v5.6 and better allow extended Unicode characters in strings. Such strings cannot be encoded directly, as the quoted-printable encoding is only defined for single-byte characters. The solution is to use the Encode module to select the byte encoding you want. For example:

use MIME::QuotedPrint qw(encode_qp);
use Encode qw(encode);

$encoded = encode_qp(encode("UTF-8", "\x{FFFF}\n"));
print $encoded;

This seems different to the encode statement you've posted above.

Did you try the example from the docs? What was the outcome of that? Assuming it works, which it does for me, it would seem straightforward enough to apply that example to your needs.

ううむ。英語っつーハンディが言い訳にならんくらいあっさりすばやくピンポイントの情報を見つけてこられてしまった。まだまだだなあ。