DeleGateによる日本語文字コード変換

DeleGate 9.4.2 では、10年ばかり御無沙汰気味だった日本語文字コード変換機能について見直しを行い、以下の拡張を行いました。

日本語コード変換関係のドキュメント類も、だいぶ古くなって内容が現状と不一致だったり、散逸もしてしまった感じなので、この際にちょっとまとめてみることにしました。

　サーバ向けの文字コード変換は、リクエストヘッダのURLの中、およびPOSTリクエストのボディ (†4)の中に、 %XX%XX あるいは &#nnnnn; (Unicode) のようにエンコードされた日本語テキストに対して適用されます。ここで、%XX%XX の形式でエンコードされたものについては、それが日本語テキストであるのか、またどの日本語コードで表現されているのかは、どこにも指示されていません。このため一つの文字列が、例えば EUC-JP としても Shift_JIS としても解釈可能な場合、変換結果が期待通りにならない場合があります (†5)。

　DeleGate/9.4.2 現在、DeleGate がサポートしている（と思われる）日本語の文字セットは、JIS X 0201, JIS X 0208, JIS X 0212 です。文字コードは、ISO-2022-JP, EUC-JP, Shift_JIS および UTF-8 です。ここで、文字コードの間で、表現できる文字セットは、それぞれに異なります。例えば「半角カナ」は、ISO-2022-JP には含まれていないので、例えば Shift_JIS から ISO-2022-JP に変換すると、半角から全角に変換されます。例えば EUC-JP でコード化されている「補助漢字」は、Shift_JIS に変換すると失われます。変換先のコードが存在しない場合その文字は（デフォルトでは）ゲタ記号(〓) にしています (†6)。

　UTF-8 (Unicode文字セット) と JIS文字セットのコードと間の変換では、 Unicode.Org から配布されている変換表を、DeleGate の起動時に（CHARSET オプションが指定されている場合）、 DeleGate.ORG 経由でダウンロードして、用いています。 DeleGate/9.4.2 現在、用いているのは、OBSOLETE と分類されている変換表で、これは JIS X 0213 を含んでいません。今後、必要性があると判断された段階で、実装したいと思っています。（Unicode については、よく言われているように、JISコードとの間の変換が一意でないという問題もあります）

以下は、DeleGate の参照マニュアル Manual.htm からの抜粋です。

[CTX] [ALL]

CHARCODE parameter* ==  CHARCODE=[inputCode/]outputCode[:[tosv][:connMap]]
        outputCode  ==  charCode
          charCode  ==  iso-2022-jp | euc-jp | shift_jis | utf-8 | us-ascii |
                               JIS | EUC | SJIS | UTF8 | ASCII | guess
           connMap  ==  [ProtoList][:[dstHostList][:[srcHostList]]]
                    --  restriction: applicable to HTTP, FTP, SMTP, POP,
                                     NNTP, Telnet, Tcprelay
                    --  default: none

JIS

The pseudo code name "guess" means not doing conversion but supplement the "charset" attribute in "Content-Type" header in a message when it is lacking, guessing it from the body of the message. This is useful when a viewer program (ex. a web browser) of the message is not localized to Japanese thus non-ASCII codes like EUC-JP are guessed as European or so by the viewer.

If "tosv" is specified, the conversion is applied to the request message (or a message toward a server). The conversion is also applied to fragments of Japanese text in a HTTP request message encoded in "%XX" in its request URL or in the body of a POST message (when it is encoded in Content-Type: application/x-www-form-urlencoded). The set of values of Content-Type to which the conversion is applied can be specified with HTTPCONF=post-ccx-type.

The application of the conversion can be limited only to a specified set of protocols, servers and clients specified with connMap. In the connMap, the default value of ProtoList, dstHostList and srcHostList is "*" which matches any protocols or hosts.

Example: send response in UTF8 to clients and send request in Shift_JIS to HTTP servers

CHARCODE=UTF8 CHARCODE=SJIS:tosv:http

For the FTP protocol, the conversion is applied only to the data of the ASCII type relayed on the data-connections by default. It is applied also to binary data or data on the control-connection with FTPCONF="ccx:any".

To enable this parameter for internet-mail/news protocols (SMTP, POP and NNTP), also MIMECONV parameter must be specified so that it enables character conversion (enabled by default).

A HTTP client can override this specification by sending its choice in "Accept-Language" field in a request message, which may be configurable in each client (WWW browser). For example, if "Accept-Language: (charcode=EUC)" is sent in a request from client, the response text will be converted into EUC regardless of CHARCODE specification of the DeleGate. If "Accept-Language: (charcode=THRU)" is specified, any conversion specified by the administrator of this DeleGate is disabled.

DeleGateによる日本語文字コード変換

■ 文字コード変換リバースプロキシとしてのDeleGate

■ リバースプロキシの構成

■ 文字コード変換プロキシ

■ 文字コード変換リバースプロキシの構成

■ 文字コード変換プロキシの問題点