2018-03-11

Javaでの和暦対応の罠

Java

ふと、Javaで和暦対応してたよなぁと思い出し、検索して見つけた以下のサイトのプログラムを元に、とあることを検証してみた。

Javaで和暦→西暦、西暦→和暦に変換する - Qiita

それは、「Calendar#setで西暦年をセットして和暦でフォーマットして出力したら、西暦年に対応する元号が分かるじゃん」というもの。

早速書いてみる。

Test1.java

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Locale;

public class Test1 {
    public static void main(String[] args) {
        Locale locale = new Locale("ja", "JP", "JP");
        Calendar cal = Calendar.getInstance(locale);
        System.out.println(cal.getClass().getName());

        int[] years = { 1900, 1925, 1950, 1975, 2000 };
        for (int year : years) {
            cal.set(Calendar.YEAR, year);
            DateFormat format = new SimpleDateFormat("GGGGy年M月d日", locale);
            System.out.println(year + ": " + format.format(cal.getTime()));
        }
    }
}

いざ実行。

$ javac Test1.java && java Test
java.util.JapaneseImperialCalendar
1900: 平成1900年3月11日
1925: 平成1925年3月11日
1950: 平成1950年3月11日
1975: 平成1975年3月11日
2000: 平成2000年3月11日
$

・・・えっ！？‥「cal.set(Calendar.YEAR, year)」って「和暦での年数をセットしないといけないの？」‥ どうやらそうらしい。

というわけでプログラムをちょこちょこいじりながら調べてみたら、「Calendar cal = Calendar.getInstance(locale)」の部分が良くないらしい。

書き直したプログラム。

Test2.java

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.Locale;

public class Test2 {
    public static void main(String[] args) {
        Locale locale = new Locale("ja", "JP", "JP");
        Calendar cal = Calendar.getInstance();
        System.out.println(cal.getClass().getName());

        int[] years = { 1900, 1925, 1950, 1975, 2000 };
        for (int year : years) {
            cal.set(Calendar.YEAR, year);
            DateFormat format = new SimpleDateFormat("GGGGy年M月d日", locale);
            System.out.println(year + ": " + format.format(cal.getTime()));
        }
    }
}

実行。

$ javac Test2.java && java Test2
java.util.GregorianCalendar
1900: 明治33年3月11日
1925: 大正14年3月11日
1950: 昭和25年3月11日
1975: 昭和50年3月11日
2000: 平成12年3月11日
$

おぉ、今度は期待通り。

ということはだよ、例えば「明治33年」の「33」という年数を素直にセットしたい場合はできないってことじゃん‥

ということで、最初のプログラム（Test1.java）の出力で得られた「java.util.JapaneseImperialCalendar」のソースを見てみる。（CentOS 7のjava-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64）

class JapaneseImperialCalendar extends Calendar {
    /*
     * Implementation Notes
     *
     * This implementation uses
     * sun.util.calendar.LocalGregorianCalendar to perform most of the
     * calendar calculations. LocalGregorianCalendar is configurable
     * and reads <JRE_HOME>/lib/calendars.properties at the start-up.
     */

    /**
     * The ERA constant designating the era before Meiji.
     */
    public static final int BEFORE_MEIJI = 0;

    /**
     * The ERA constant designating the Meiji era.
     */
    public static final int MEIJI = 1;

    /**
     * The ERA constant designating the Taisho era.
     */
    public static final int TAISHO = 2;

    /**
     * The ERA constant designating the Showa era.
     */
    public static final int SHOWA = 3;

    /**
     * The ERA constant designating the Heisei era.
     */
    public static final int HEISEI = 4;

パッケージプライベートなクラスなので直接の参照はできないが、明治、大正、昭和、平成に対応するCalendar.ERAの定数らしきが定義されている。というわけで試してみる。

Test3.java

import java.text.DateFormat;
import java.text.SimpleDateFormat;
import java.util.Calendar;
import java.util.HashMap;
import java.util.Map;
import java.util.Locale;

public class Test3 {
    public static void main(String[] args) {
        Locale locale = new Locale("ja", "JP", "JP");
        Calendar cal = Calendar.getInstance(locale);
        System.out.println(cal.getClass().getName());

        Map<Integer, Integer> years = new HashMap<>();
        years.put(1, 33); // 明治33年
        years.put(2, 14); // 大正14年
        years.put(3, 25); // 昭和25年
        years.put(4, 12); // 平成12年
        for (Map.Entry<Integer, Integer> entry : years.entrySet()) {
            int era = entry.getKey();
            int year = entry.getValue();
            cal.set(Calendar.ERA, era);
            cal.set(Calendar.YEAR, year);
            DateFormat format = new SimpleDateFormat("GGGGy年M月d日", locale);
            System.out.println(year + ": " + format.format(cal.getTime()));
        }
    }
}

実行してみる。

$ javac Test3.java && java Test3
java.util.JapaneseImperialCalendar
33: 明治33年3月11日
14: 大正14年3月11日
25: 昭和25年3月11日
12: 平成12年3月11日
$

おぉ、できた、できた、定数の値の実装が変わったらアウトだがなw‥

2018-03-08

サーバー名の末尾にドットを付けたリクエストを送ると予期せぬ応答が返ってくる問題

Java Java 7 Java 8 SSL

前置き
事前準備
サーバープログラム
接続テスト（サーバー：OpenJDK 1.7.0）
接続テスト（サーバー：OpenJDK 1.8.0）
検証
参考

前置き

以下の2つの環境での挙動が違うことに悩んでいた。

CentOS 6.9＋OpenJDK 1.8.0＋Apache Tomcat 6.0
CentOS 5.11＋OpenJDK 1.7.0＋Apache Tomcat 5.5

何をしたときの挙動かというと、

$ echo '' | openssl s_client -connect localhost:8443 -servername hhelibex.local. -showcerts

のように、サーバー名の末尾にドットがついたリクエストを送ったときにSSL/TLSハンドシェイクに失敗する場合としない場合があるということ。つまり、失敗する場合には以下のようなレスポンスが返ってくる。

CONNECTED(00000003)
139918566537120:error:140773F2:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert unexpected message:s23_clnt.c:769:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 324 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID: 
    Session-ID-ctx: 
    Master-Key: 
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1520436816
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---

成功する場合には、以下のようにCipherに適切な文字列が返ってくるし、ドットを付けない場合も同様に成功する。また、CentOS 6環境のJDKをOpenJDK 1.7.0に切り替えてみても成功するようになる。

(省略)
    Cipher    : ECDHE-RSA-AES256-GCM-SHA384
(省略)
---
DONE

そこで、Java 7とJava 8で何かが変わったのだろうと思い、その部分を特定するためにTomcat 6のソースを眺めてみたりしてものすごく遠回りしたのだが、「もしかして、根本的なところで仕様が変わっているんじゃね？」と思い始め、別の環境（CentOS 7）で簡単なサーバープログラムを動かしてみたら何か分かるかもということで試してみた。

事前準備

まず、使用するJDK環境を2つ用意する。OpenJDK 1.7.0とOpenJDK 1.8.0。インストールの手順は省略する。

で、alternativesで切り替えて挙動を見る。

事前準備としては、コンパイラ（javac）はOpenJDK 1.7.0にしておき、インタプリタは必要に応じて切り替える。最初はOpenJDK 1.7.0にしておく。

$ sudo alternatives --config javac

2 プログラムがあり 'javac' を提供します。

  選択       コマンド
-----------------------------------------------
*+ 1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64/bin/javac)
   2           java-1.7.0-openjdk.x86_64 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.171-2.6.13.0.el7_4.x86_64/bin/javac)

Enter を押して現在の選択 [+] を保持するか、選択番号を入力します:2
$ sudo alternatives --config java

2 プログラムがあり 'java' を提供します。

  選択       コマンド
-----------------------------------------------
*+ 1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64/jre/bin/java)
   2           java-1.7.0-openjdk.x86_64 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.171-2.6.13.0.el7_4.x86_64/jre/bin/java)

Enter を押して現在の選択 [+] を保持するか、選択番号を入力します:2
$ javac -version
javac 1.7.0_171
$ java -version
java version "1.7.0_171"
OpenJDK Runtime Environment (rhel-2.6.13.0.el7_4-x86_64 u171-b01)
OpenJDK 64-Bit Server VM (build 24.171-b01, mixed mode)
$

次に、サーバー証明書を適当に作っておく。

$ openssl genrsa -aes256 2048 -out server.key
Generating RSA private key, 2048 bit long modulus
....................................+++
......................+++
e is 65537 (0x10001)
Enter pass phrase: test
Verifying - Enter pass phrase: test
-----BEGIN RSA PRIVATE KEY-----
(省略)
-----END RSA PRIVATE KEY-----
$ openssl rsa -in server.key -out server.key 
writing RSA key
$ openssl req -new -sha256 -key server.key -out server.csr
You are about to be asked to enter information that will be incorporated
into your certificate request.
What you are about to enter is what is called a Distinguished Name or a DN.
There are quite a few fields but you can leave some blank
For some fields there will be a default value,
If you enter '.', the field will be left blank.
-----
Country Name (2 letter code) [XX]:JP
State or Province Name (full name) []:Hokkaido
Locality Name (eg, city) [Default City]:Sapporo
Organization Name (eg, company) [Default Company Ltd]:HHeLiBeX Ltd.
Organizational Unit Name (eg, section) []:
Common Name (eg, your name or your server's hostname) []:hhelibex.local
Email Address []:

Please enter the following 'extra' attributes
to be sent with your certificate request
A challenge password []:
An optional company name []:
$ openssl x509 -in server.csr -out server.crt -req -signkey server.key -days 365
Signature ok
subject=/C=JP/ST=Hokkaido/L=Sapporo/O=HHeLiBeX Ltd./CN=hhelibex.local
Getting Private key
$ openssl pkcs12 -export -inkey server.key -in server.crt -out server.p12
Enter Export Password: changeit
Verifying - Enter Export Password: changeit
$

サーバープログラム

以下のような簡単なプログラムを作って起動する。テストだから、必要なパラメータはすべてべた書きだがご容赦を。

SSLServer.java

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.FileInputStream;
import java.io.InputStreamReader;
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.net.ServerSocket;
import java.net.Socket;
import java.security.KeyStore;

import javax.net.ServerSocketFactory;
import javax.net.ssl.KeyManagerFactory;
import javax.net.ssl.SSLContext;

public class SSLServer {
    public static void main(String[] args) {
        try {
            String keyStoreFile = "server.p12";
            char[] keyStorePassword = "changeit".toCharArray();

            KeyStore keyStore = KeyStore.getInstance("PKCS12");
            keyStore.load(new FileInputStream(keyStoreFile), keyStorePassword);

            KeyManagerFactory kmf = KeyManagerFactory.getInstance("SunX509");
            kmf.init(keyStore, keyStorePassword);

            SSLContext sslContext = SSLContext.getInstance("TLS");
            sslContext.init(kmf.getKeyManagers() , null , null);
            ServerSocketFactory ssf = sslContext.getServerSocketFactory();
            ServerSocket serverSocket  = ssf.createServerSocket(8443);

            while (true) {
                System.out.println("--------" + System.getProperty("java.version") + "--------");
                System.out.println("Waiting for SSL connection");

                Socket socket = null;
                BufferedReader in = null;
                BufferedWriter out = null;
                try {
                    socket = serverSocket.accept();
                    if (socket == null) {
                        System.out.println("Client socket is null, ignored.");
                        continue;
                    }
                    System.out.println("Accepted.");

                    in = new BufferedReader(new InputStreamReader(socket.getInputStream()));
                    out = new BufferedWriter(new OutputStreamWriter(socket.getOutputStream()));

                    String msg = in.readLine();
                    System.out.println("Message from client: " + msg);
                } catch (Exception e) {
                    e.printStackTrace();
                } finally {
                    if (in != null) {
                        try { in.close(); } catch (IOException e) { e.printStackTrace(); }
                    }
                    if (out != null) {
                        try { out.close(); } catch (IOException e) { e.printStackTrace(); }
                    }
                    if (socket != null) {
                        socket.close();
                    }
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

接続テスト（サーバー：OpenJDK 1.7.0）

起動

$ sudo alternatives --config java

2 プログラムがあり 'java' を提供します。

  選択       コマンド
-----------------------------------------------
*  1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64/jre/bin/java)
 + 2           java-1.7.0-openjdk.x86_64 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.171-2.6.13.0.el7_4.x86_64/jre/bin/java)

Enter を押して現在の選択 [+] を保持するか、選択番号を入力します:2
$ java SSLServer
--------1.7.0_171--------
Waiting for SSL connection

クライアントから接続

$ echo "Hello World" | openssl s_client -connect localhost:8443 -servername hhelibex.local. -showcerts
CONNECTED(00000003)
depth=0 C = JP, ST = Hokkaido, L = Sapporo, O = HHeLiBeX Ltd., CN = hhelibex.local
verify error:num=18:self signed certificate
verify return:1
depth=0 C = JP, ST = Hokkaido, L = Sapporo, O = HHeLiBeX Ltd., CN = hhelibex.local
verify return:1
---
Certificate chain
 0 s:/C=JP/ST=Hokkaido/L=Sapporo/O=HHeLiBeX Ltd./CN=hhelibex.local
   i:/C=JP/ST=Hokkaido/L=Sapporo/O=HHeLiBeX Ltd./CN=hhelibex.local
-----BEGIN CERTIFICATE-----
MIIDQjCCAioCCQC3i2D7IVfOsTANBgkqhkiG9w0BAQsFADBjMQswCQYDVQQGEwJK
UDERMA8GA1UECAwISG9ra2FpZG8xEDAOBgNVBAcMB1NhcHBvcm8xFjAUBgNVBAoM
DUhIZUxpQmVYIEx0ZC4xFzAVBgNVBAMMDmhoZWxpYmV4LmxvY2FsMB4XDTE4MDMw
NzE2MjU1NloXDTE5MDMwNzE2MjU1NlowYzELMAkGA1UEBhMCSlAxETAPBgNVBAgM
CEhva2thaWRvMRAwDgYDVQQHDAdTYXBwb3JvMRYwFAYDVQQKDA1ISGVMaUJlWCBM
dGQuMRcwFQYDVQQDDA5oaGVsaWJleC5sb2NhbDCCASIwDQYJKoZIhvcNAQEBBQAD
ggEPADCCAQoCggEBAM2I/AxPvGX4Pu9BqwmP0XoAAXpCYIrRNNa+bw1irLHQ82o+
J1bEGev9x6/jYhA+L7AYHMGnE6gbg1azrxyc06OdzN5X6OTC9xia6S7+LP3Ar8P6
c3BURqoU7TWOZdT7/KmPfgPC/uNty0lgA1U74PsdSMDE6VPU/MVyRoezsDLiXPrV
p9eve9bXHiuRF1G+Y0lO3Ym3fGilIZa7HEEqeVLTrKWmS2odOlIT/t8VAfCKBds9
nd38hwPThB0k9F6fhkUgwDeEZZdXXtMh+UNKEHdkI/VGji6uL72sAnwTc1H2VoAc
srKcdmzAWh8Kj9ynVenoAjWUTD4LV3L/JF76PLcCAwEAATANBgkqhkiG9w0BAQsF
AAOCAQEAgpx32ml5YAacemf63fk9fNS7czjUZqvHBtfR4B6Vl+nmwHnIOazjSgRz
WRv86ZnP9t2bh5myhFZtg47BLI6gW8Ca4a92lehz1Cl6tB5sqbBk0vm3Jd78SzIV
T1Kxx9CgOEEgT54WuLyRpuaQwICQrKZgWWysCojRtiK/7tZ0amCW8CMfakLdvAaW
NF3814ZPS5AiIWrxaQ4XVEw2kB8yGapHO9dLMO81Jr9xpzjG2ENMR7aAxzPcdFvE
HJmIAyNYN8e6J/KcYAlAGz3Jo+ppcqn3De2GlFWwvPKiuVrDTeuM9r68u+rRFrmA
wTxbkeP/rGxXRzG3G2VQmgY7rtBI6A==
-----END CERTIFICATE-----
---
Server certificate
subject=/C=JP/ST=Hokkaido/L=Sapporo/O=HHeLiBeX Ltd./CN=hhelibex.local
issuer=/C=JP/ST=Hokkaido/L=Sapporo/O=HHeLiBeX Ltd./CN=hhelibex.local
---
No client certificate CA names sent
Peer signing digest: SHA512
Server Temp Key: ECDH, P-256, 256 bits
---
SSL handshake has read 1378 bytes and written 495 bytes
---
New, TLSv1/SSLv3, Cipher is ECDHE-RSA-AES256-SHA384
Server public key is 2048 bit
Secure Renegotiation IS supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : ECDHE-RSA-AES256-SHA384
    Session-ID: 5AA01ABEC478A8640AF9BFA21DB39BD751CC7BD651F29EFCEA4A14F91B2D7FC8
    Session-ID-ctx: 
    Master-Key: 47C8AF2C33540069495BD8372059306B8411D19F684007DFF8E88BEE5653690CCE5C010145C58D54962A14C29418F89D
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1520442046
    Timeout   : 300 (sec)
    Verify return code: 18 (self signed certificate)
---
DONE
$

サーバー側でのメッセージ出力

メッセージは、起動時からの続き。

Accepted.
Message from client: Hello World
--------1.7.0_171--------
Waiting for SSL connection
^C
$

サーバー証明書が問題なく取得できた。クライアントからのメッセージ「Hello World」もちゃんと送られている。

接続テスト（サーバー：OpenJDK 1.8.0）

起動

$ sudo alternatives --config java

2 プログラムがあり 'java' を提供します。

  選択       コマンド
-----------------------------------------------
*  1           java-1.8.0-openjdk.x86_64 (/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.151-5.b12.el7_4.x86_64/jre/bin/java)
 + 2           java-1.7.0-openjdk.x86_64 (/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.171-2.6.13.0.el7_4.x86_64/jre/bin/java)

Enter を押して現在の選択 [+] を保持するか、選択番号を入力します:1
$ java SSLServer
--------1.8.0_151--------
Waiting for SSL connection

クライアントから接続

$ echo "Hello World" | openssl s_client -connect localhost:8443 -servername hhelibex.local. -showcerts
CONNECTED(00000003)
140566511749024:error:140773F2:SSL routines:SSL23_GET_SERVER_HELLO:sslv3 alert unexpected message:s23_clnt.c:769:
---
no peer certificate available
---
No client certificate CA names sent
---
SSL handshake has read 7 bytes and written 313 bytes
---
New, (NONE), Cipher is (NONE)
Secure Renegotiation IS NOT supported
Compression: NONE
Expansion: NONE
No ALPN negotiated
SSL-Session:
    Protocol  : TLSv1.2
    Cipher    : 0000
    Session-ID: 
    Session-ID-ctx: 
    Master-Key: 
    Key-Arg   : None
    Krb5 Principal: None
    PSK identity: None
    PSK identity hint: None
    Start Time: 1520441888
    Timeout   : 300 (sec)
    Verify return code: 0 (ok)
---
$

サーバー側でのメッセージ出力

メッセージは、起動時からの続き。

Accepted.
javax.net.ssl.SSLProtocolException: Illegal server name, type=host_name(0), name=hhelibex.local., value=68:68:65:6c:69:62:65:78:2e:6c:6f:63:61:6c:2e
        at sun.security.ssl.ServerNameExtension.<init>(ServerNameExtension.java:143)
        at sun.security.ssl.HelloExtensions.<init>(HelloExtensions.java:78)
        at sun.security.ssl.HandshakeMessage$ClientHello.<init>(HandshakeMessage.java:245)
        at sun.security.ssl.ServerHandshaker.processMessage(ServerHandshaker.java:220)
        at sun.security.ssl.Handshaker.processLoop(Handshaker.java:1026)
        at sun.security.ssl.Handshaker.process_record(Handshaker.java:961)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1072)
        at sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1385)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:938)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
        at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284)
        at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326)
        at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178)
        at java.io.InputStreamReader.read(InputStreamReader.java:184)
        at java.io.BufferedReader.fill(BufferedReader.java:161)
        at java.io.BufferedReader.readLine(BufferedReader.java:324)
        at java.io.BufferedReader.readLine(BufferedReader.java:389)
        at SSLServer.main(SSLServer.java:50)
Caused by: java.lang.IllegalArgumentException: Server name value of host_name cannot have the trailing dot
        at javax.net.ssl.SNIHostName.checkHostName(SNIHostName.java:319)
        at javax.net.ssl.SNIHostName.<init>(SNIHostName.java:183)
        at sun.security.ssl.ServerNameExtension.<init>(ServerNameExtension.java:137)
        ... 17 more
--------1.8.0_151--------
Waiting for SSL connection
^C
$

今度は例外が発生し、接続が切られている。もちろん、クライアント側では証明書の取得はできず。

検証

失敗したOpenJDK 1.8.0の場合の例外メッセージを見てみると、以下のようなメッセージが読み取れる。

「Illegal server name, ・・・」
「Server name value of host_name cannot have the trailing dot」

そこで、例外メッセージにあるjavax.net.ssl.SNIHostNameを見てみる。

SNIHostName (Java Platform SE 8)

すると、こんなことが書いてある。

TLS拡張(RFC 6066)のセクション3「Server Name Indication」で説明されているように、「HostName」には、クライアントが理解できるサーバーの完全修飾DNSホスト名が含まれます。ホスト名のエンコードされたサーバー名の値は、ASCIIエンコーディングを使用したドットで終わらないバイト文字列として表現されます。

また、

導入されたバージョン
1.8

「ドットで終わらないバイト文字列」！！

なるほど、RFC 6066がJava 8から実装され、ドットで終わるホスト名が許されなくなったということらしい。

これで冒頭の疑問が解ける。成功する方はJava 7なのでドットで終わるホスト名がまだ許されていたが、失敗する方はJava 8からのRFC 6066の実装によってドットで終わるホスト名が許されなくなった。

通常はドットで終わるホスト名でアクセスすることはないだろうから問題になることは少ないのだろうが、これは罠だなぁ‥ともあれ、疑問が解消されてよかった‥

参考

JavaのSSLSocketでSSLクライアントとSSLサーバーを実装する：CodeZine（コードジン）
- ソースコードは会員登録しないとダウンロードできないが、SSLServerSocketを使ったプログラムをまともに書いたことが無かったので、とても助かった
RFC 6066
- RFC 6066の原文

2017-12-13

mb_encode_mimeheader/mb_decode_mimeheaderする際には内部文字エンコーディングに注意

PHP

マニュアルをちゃんと読むと書いてあるのだが。

前者のmb_encode_mimeheader()は、以下のように書いてある。

パラメータ

str
　エンコードする文字列。 mb_internal_encoding() と同じエンコーディングにしなければいけません。

つまり、マルチバイト文字を扱う限り、mb_internal_encoding()による内部文字エンコーディングの設定が必須なのだ。

ついでに、後者のmb_decode_mimeheader()のマニュアルも見てみる。

返り値

内部文字エンコーディングでデコードされた文字列を返します。

「内部文字エンコーディング」と明記してある。

というわけで、ちょっと試してみるために、以下のようなシナリオに沿ったプログラムを書いてみる。

共通仕様
1. 通常は、内部文字エンコーディングを「UTF-8」にして処理を行うシステムである。
2. メール送信は「ISO-2022-JP」で行う。
シナリオ1
1. メール送信のためにISO-2022-JPに変換した文字列を生成する。
2. 内部文字エンコーディングを変えずに、mb_encode_mimeheader()を呼び出す。
3. エンコードした文字列を出力。(疑似的なメール送信)
4. 受け取った(得られた)文字列を、mb_decode_mimeheader()でデコードする。
シナリオ2
1. メール送信のためにISO-2022-JPに変換した文字列を生成する。
2. 内部文字エンコーディングを「ISO-2022-JP」に変更する。
3. mb_encode_mimeheader()を呼び出す。
4. 内部文字エンコーディングを「UTF-8」に戻す。
5. エンコードした文字列を出力。(疑似的なメール送信)
6. 内部文字エンコーディングを「ISO-2022-JP」に変更する。
7. 受け取った(得られた)文字列を、mb_decode_mimeheader()でデコードする。
8. 内部文字エンコーディングを「UTF-8」に戻す。
シナリオ3
1. メール送信のためにISO-2022-JPに変換した文字列を生成する。
2. 内部文字エンコーディングを「ISO-2022-JP」に変更する。
3. mb_encode_mimeheader()を呼び出す。
4. 内部文字エンコーディングを「UTF-8」に戻す。
5. エンコードした文字列を出力。(疑似的なメール送信)
6. 内部文字エンコーディングを変えない(UTF-8のまま)
7. 受け取った(得られた)文字列を、mb_decode_mimeheader()でデコードする。

共通部品/処理

od.php

<?php

/*
 * コマンド「od -c」を模したダンプを出力する関数。
 */
function od($str) {
    for ($i = 0; $i < strlen($str); ++$i) {
        if ($i % 16 === 0) {
            if ($i > 0) {
                echo PHP_EOL;
            }
            printf("%08o", $i);
        }
        $s = substr($str, $i, 1);
        if (ctype_graph($s) || $s === " ") {
            printf(" %3s", $s);
        } else {
            printf(" %03o", ord($s));
        }
    }
    if ($i % 16 !== 15) {
        echo PHP_EOL;
    }
    printf("%08o\n", $i);
}

init.php

<?php

// ** このファイルはもちろん「UTF-8」で保存する **

// 初期設定
mb_internal_encoding("UTF-8");

// メールで送る文字列の件名
$str = "あいうえおかきくけこさしすせそたちつてとなにぬねの";

シナリオ1

ソースコードは以下の通り。

<?php

include("od.php");
include("init.php");

// ** このファイルはもちろん「UTF-8」で保存する **

// 文字列の文字エンコーディング変換
$convStr = mb_convert_encoding($str, "ISO-2022-JP", "UTF-8");

// エンコード処理
$encStr = mb_encode_mimeheader($convStr, "ISO-2022-JP");
echo "// エンコードされた文字列" . PHP_EOL;
var_dump($encStr);

echo PHP_EOL;

// デコード処理
$decStr = mb_decode_mimeheader($encStr);
echo "// デコードされた文字列" . PHP_EOL;
var_dump($decStr);
od($decStr);

実行結果。

// エンコードされた文字列
string(115) "=?ISO-2022-JP?B?GyRCJCIkJCQmJCgkKiQrJC0kLyQxJDMkNSQ3JDkkOyQ9JD8kQSREJEYk?=
 =?ISO-2022-JP?B?SCRKJEskTCRNJE4bKEI=?="

// デコードされた文字列
string(68) "あいうえおかきくけこさしすせそたちつてH$J$K$L$M$N"
00000000 343 201 202 343 201 204 343 201 206 343 201 210 343 201 212 343
00000020 201 213 343 201 215 343 201 217 343 201 221 343 201 223 343 201
00000040 225 343 201 227 343 201 231 343 201 233 343 201 235 343 201 237
00000060 343 201 241 343 201 244 343 201 246   H   $   J   $   K   $   L
00000100   $   M   $   N
00000104

あれ、文字化けした。

多分、mb_decode_mimeheader()でデコードするときに以下のような処理をしているのだろうと思われる。

<?php
mb_internal_encoding("UTF-8");
var_dump(
    mb_convert_encoding(base64_decode("GyRCJCIkJCQmJCgkKiQrJC0kLyQxJDMkNSQ3JDkkOyQ9JD8kQSREJEYk"), mb_internal_encoding(), "ISO-2022-JP")
    . mb_convert_encoding(base64_decode("SCRKJEskTCRNJE4bKEI="), mb_internal_encoding(), "ISO-2022-JP"));

string(68) "あいうえおかきくけこさしすせそたちつてH$J$K$L$M$N"

同じ結果になった。

というわけで、これではダメ。

シナリオ2

ソースコードは以下の通り。

<?php

include("od.php");
include("init.php");

// ** このファイルはもちろん「UTF-8」で保存する **

// 文字列の文字エンコーディング変換
$convStr = mb_convert_encoding($str, "ISO-2022-JP", "UTF-8");

// 内部文字エンコーディングを変更
$origInternalEncoding = mb_internal_encoding();
mb_internal_encoding("ISO-2022-JP");

// エンコード処理
$encStr = mb_encode_mimeheader($convStr, "ISO-2022-JP");
echo "// エンコードされた文字列" . PHP_EOL;
var_dump($encStr);

// 内部文字エンコーディングを戻す
mb_internal_encoding($origInternalEncoding);

echo PHP_EOL;

// 内部文字エンコーディングを変更
mb_internal_encoding("ISO-2022-JP");

// デコード処理
$decStr = mb_decode_mimeheader($encStr);
echo "// デコードされた文字列" . PHP_EOL;
var_dump($decStr);
od($decStr);

// 内部文字エンコーディングを戻す
mb_internal_encoding($origInternalEncoding);

実行結果。

// エンコードされた文字列
string(123) "=?ISO-2022-JP?B?GyRCJCIkJCQmJCgkKiQrJC0kLyQxJDMkNSQ3JDkkOyQ9JD8kQSREGyhC?=
 =?ISO-2022-JP?B?GyRCJEYkSCRKJEskTCRNJE4bKEI=?="

// デコードされた文字列
string(56) "あいうえおかきくけこさしすせそたちつてとなにぬねの"
00000000 033   $   B   $   "   $   $   $   &   $   (   $   *   $   +   $
00000020   -   $   /   $   1   $   3   $   5   $   7   $   9   $   ;   $
00000040   =   $   ?   $   A   $   D   $   F   $   H   $   J   $   K   $
00000060   L   $   M   $   N 033   (   B
00000070

一見よさそうだが、ダンプを見ると、「ISO-2022-JP」になっている。システムの内部文字エンコーディングは「UTF-8」だったはずだ。

これもダメ。

シナリオ3

ソースコードは以下の通り。

<?php

include("od.php");
include("init.php");

// ** このファイルはもちろん「UTF-8」で保存する **

// 文字列の文字エンコーディング変換
$convStr = mb_convert_encoding($str, "ISO-2022-JP", "UTF-8");

// 内部文字エンコーディングを変更
$origInternalEncoding = mb_internal_encoding();
mb_internal_encoding("ISO-2022-JP");

// エンコード処理
$encStr = mb_encode_mimeheader($convStr, "ISO-2022-JP");
echo "// エンコードされた文字列" . PHP_EOL;
var_dump($encStr);

// 内部文字エンコーディングを戻す
mb_internal_encoding($origInternalEncoding);

echo PHP_EOL;

// デコード処理
$decStr = mb_decode_mimeheader($encStr);
echo "// デコードされた文字列" . PHP_EOL;
var_dump($decStr);
od($decStr);

実行結果。

// エンコードされた文字列
string(123) "=?ISO-2022-JP?B?GyRCJCIkJCQmJCgkKiQrJC0kLyQxJDMkNSQ3JDkkOyQ9JD8kQSREGyhC?=
 =?ISO-2022-JP?B?GyRCJEYkSCRKJEskTCRNJE4bKEI=?="

// デコードされた文字列
string(75) "あいうえおかきくけこさしすせそたちつてとなにぬねの"
00000000 343 201 202 343 201 204 343 201 206 343 201 210 343 201 212 343
00000020 201 213 343 201 215 343 201 217 343 201 221 343 201 223 343 201
00000040 225 343 201 227 343 201 231 343 201 233 343 201 235 343 201 237
00000060 343 201 241 343 201 244 343 201 246 343 201 250 343 201 252 343
00000100 201 253 343 201 254 343 201 255 343 201 256
00000113

よさそうである。文字化けしていないし、ダンプを見ても「UTF-8」になっている。念のため確認で以下を実行してみた(コンソールの文字エンコーディングはUTF-8)。

$ echo -n 'あいうえおかきくけこさしすせそたちつてとなにぬねの' | od -c
0000000 343 201 202 343 201 204 343 201 206 343 201 210 343 201 212 343
0000020 201 213 343 201 215 343 201 217 343 201 221 343 201 223 343 201
0000040 225 343 201 227 343 201 231 343 201 233 343 201 235 343 201 237
0000060 343 201 241 343 201 244 343 201 246 343 201 250 343 201 252 343
0000100 201 253 343 201 254 343 201 255 343 201 256
0000113
$

同じ結果だ。

まとめ

というわけで、まとめると、以下のようにして各関数を呼び出さないといけない。

mb_encode_mimeheader()
- mb_internal_encoding()で、内部文字エンコーディングを、「渡す文字列の文字エンコーディング(つまり、今回の例の場合はメール送信の際の文字エンコーディング)」に変更してから呼び出さないといけない。
  - なぜなら、この関数に渡す文字列の文字エンコーディングが何になっているかの情報が必要だから。
mb_decode_mimeheader()
- mb_internal_encoding()を呼び出すとしたら、指定するのは、「システムで文字列を扱う際の文字エンコーディング(つまり、今回の例ではUTF-8)」でなければならない。
  - なぜなら、「ISO-2022-JP」という情報は、MIME エンコーディングされた文字列に既に書いてあるから必要ない。むしろ必要なのは変換後の文字列の文字エンコーディングである。

2017-12-12

各言語で正規表現「^」「$」「\A」「\z」を試してみる

Java C C++ PHP Python Ruby Perl Go

徳丸浩の日記: 正規表現によるバリデーションでは ^ と $ ではなく \A と \z を使おう https://t.co/Lc20UYnwMT
— HHeLiBeX (@hhelibex) 2017年12月11日

ということで、あちこちから突っ込みが来ないことを祈りつつ(謎)、手元にある各言語でテストプログラムを書いてみたメモ。

入力は以下のような文字列。

abc
123
*+=

最大4つのパターンを試すが、すべてのパターンでマッチすると、

のように出力される。逆に、マッチしないパターンや、そもそも存在しないマッチ方法の場合は「0」や「-」をそれぞれ出力する。

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)

Java

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    private static boolean matches(String str, String pattern, int flags) {
        Pattern p = Pattern.compile(pattern, flags);
        Matcher m = p.matcher(str);
        return m.matches();
    }

    public static void main(String[] args) {
        try (Reader in = new InputStreamReader(System.in);
            PrintWriter out = new PrintWriter(System.out)
        ) {
            char[] buf = new char[1024];
            int len = in.read(buf);
            String s = new String(buf, 0, len);
//          s = s.trim();

            if (matches(s, "^[0-9]+$", 0)) {
                out.print("1");
            } else {
                out.print("0");
            }

            if (matches(s, "^[0-9]+$", Pattern.MULTILINE)) {
                out.print("2");
            } else {
                out.print("0");
            }

            if (matches(s, "\\A[0-9]+\\z", 0)) {
                out.print("3");
            } else {
                out.print("0");
            }

            if (matches(s, "\\A[0-9]+\\z", Pattern.MULTILINE)) {
                out.print("4");
            } else {
                out.print("0");
            }

            out.println();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

出力。

どのパターンでもマッチしない。完全にマッチしないとダメなようだ。

C

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <regex.h>

int matches(const char* str, const char* pattern, int flags) {
    regex_t rb;
    if (regcomp(&rb, pattern, flags)) {
        perror(pattern);
        exit(1);
    }

    regmatch_t rm;
    int res;
    if (!regexec(&rb, str, 1, &rm, 0)) {
        res = 1;
    } else {
        res = 0;
    }

    regfree(&rb);

    return res;
}

int main(int argc, char** argv) {
    char str[1024];
    memset(str, '\0', sizeof(str));

    fread(str, sizeof(str), sizeof(char), stdin);
//  while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
//      str[strlen(str) - 1] = '\0';
//  }

    if (matches(str, "^[0-9]+$", REG_EXTENDED)) {
        printf("1");
    } else {
        printf("0");
    }

    // そもそも複数行モードが無い
    printf("-");

    // 文字列の先頭・末尾という正規表現が無い
    printf("-");

    // そもそも複数行モードが無い
    printf("-");

    printf("\n");

    return 0;
}

出力。

0---

まぁ、C言語は仕方がない。パターンが1つしかないので。

C++

#include <iostream>
#include <locale>
#include <string>
#include <boost/regex.hpp>

using namespace std;

bool matches(string str, const char* pattern, boost::match_flag_type flags) {
    boost::regex re(pattern);

    boost::smatch sm;
    return boost::regex_search(str, sm, re, flags);
}

int main(int argc, char** argv) {
    istreambuf_iterator<char> it(cin);
    istreambuf_iterator<char> last;
    string str(it, last);
//  str.erase(str.find_last_not_of("\r\n") + 1);

    if (matches(str, "^[0-9]+$", boost::regex_constants::match_single_line)) {
        cout << "1";
    } else {
        cout << "0";
    }

    if (matches(str, "^[0-9]+$", boost::regex_constants::match_default)) {
        cout << "2";
    } else {
        cout << "0";
    }

    if (matches(str, "\\A[0-9]+\\z", boost::regex_constants::match_single_line)) {
        cout << "3";
    } else {
        cout << "0";
    }

    if (matches(str, "\\A[0-9]+\\z", boost::regex_constants::match_default)) {
        cout << "4";
    } else {
        cout << "0";
    }

    cout << endl;

    return EXIT_SUCCESS;
}

出力。

これがうわさに聞く、複数行モードで「^」「$」を使うと部分文字列にマッチするというものか。

最初のパターンでわざわざ「boost::regex_constants::match_single_line」をフラグに指定していることから分かるように、C++(Boost)のデフォルトは複数行モードのようだ。

PHP

<?php

$s = file_get_contents('php://stdin');
//$s = trim($s);

if (preg_match("/^[0-9]+$/", $s)) {
    echo '1';
} else {
    echo '0';
}

if (preg_match("/^[0-9]+$/m", $s)) {
    echo '2';
} else {
    echo '0';
}

if (preg_match("/\A[0-9]+\z/", $s)) {
    echo '3';
} else {
    echo '0';
}

if (preg_match("/\A[0-9]+\z/m", $s)) {
    echo '4';
} else {
    echo '0';
}

echo PHP_EOL;

出力。

同様に、複数行モードだと「^」「$」を使うと部分文字列にマッチする。

Python 2 / 3

import sys
import re

s = sys.stdin.read()
#s = s.strip()

if re.search(r'^[0-9]+$', s):
    sys.stdout.write('1')
else:
    sys.stdout.write('0')

if re.search(r'^[0-9]+$', s, re.MULTILINE):
    sys.stdout.write('2')
else:
    sys.stdout.write('0')

if re.search(r'\A[0-9]+\Z', s):
    sys.stdout.write('3')
else:
    sys.stdout.write('0')

if re.search(r'\A[0-9]+\Z', s, re.MULTILINE):
    sys.stdout.write('4')
else:
    sys.stdout.write('0')

sys.stdout.write("\n")

出力。

同様に、複数行モードだと「^」「$」を使うと部分文字列にマッチする。

なお、文字列の末尾を表す正規表現が「\z」ではなく「\Z」となることに注意。

Ruby

s = STDIN.read
#s.chomp!

# 単一行モードが無いので。
print "-"

if s.match(/^[0-9]+$/)
    print "2"
else
    print "0"
end

# 単一行モードが無いので。
print "-"

if s.match(/\A[0-9]+\z/)
    print "4"
else
    print "0"
end

print "\n"

出力。

-2-0

調べた限りでは複数行モードしかなかったので、複数行モードのみの出力。

確かに、「^」「$」で部分文字列にマッチする。

Perl

my $s;
{
    local $/ = undef;
    $s = <STDIN>;
}
#chomp($s);

if ($s =~ /^[0-9]+$/) {
    print '1';
} else {
    print '0';
}

if ($s =~ /^[0-9]+$/m) {
    print '2';
} else {
    print '0';
}

if ($s =~ /\A[0-9]+\z/) {
    print '3';
} else {
    print '0';
}

if ($s =~ /\A[0-9]+\z/m) {
    print '4';
} else {
    print '0';
}

print "\n";

出力。

PHPと同様に、mフラグを付けてやると複数行モードで、「^」「$」を使用すると部分文字列にマッチする。

Go

package main

import (
    "fmt"
    "os"
    "regexp"
    "bufio"
    "io/ioutil"
//  "strings"
)

func main() {
    stdin := bufio.NewReader(os.Stdin)
    b, _ := ioutil.ReadAll(stdin)
    s := string(b)
//  s = strings.Trim(s, "\r\n")

    {
        m := regexp.MustCompile(`^[0-9]+$`)
        if m.MatchString(s) {
            fmt.Print("1")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`(?m)^[0-9]+$`)
        if m.MatchString(s) {
            fmt.Print("2")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`\A[0-9]+\z`)
        if m.MatchString(s) {
            fmt.Print("3")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`(?m)\A[0-9]+\z`)
        if m.MatchString(s) {
            fmt.Print("4")
        } else {
            fmt.Print("0")
        }
    }

    fmt.Println()
}

出力。

PHPと同様に、mフラグを付けてやると複数行モードで、「^」「$」を使用すると部分文字列にマッチする。

まとめ

表にまとめると、以下のような感じか。マッチするケースに「○」、マッチしないケースに「×」を入れている。存在しないパターンは「－」としている。

(1) 単一行モードで、「^」「$」を使ったパターン
(2) 複数行モードで、「^」「$」を使ったパターン
(3) 単一行モードで、「\A」「\z」を使ったパターン
(4) 複数行モードで、「\A」「\z」を使ったパターン

	(1)	(2)	(3)	(4)
Java	×	×	×	×
C	×	－	－	－
C++	×	○	×	×
PHP	×	○	×	×
Python 2 / 3	×	○	×	×
Ruby	－	○	－	×
Perl	×	○	×	×
Go	×	○	×	×

自分も「^」「$」をついつい使ってしまっていたので、気に留めておくことにしよう。

2017-12-11

各言語で部分文字列を取得してみる

Java C C++ PHP Python Ruby Perl Go bash Awk

各言語で入力された文字列の部分文字列を取得するプログラムを書いてみたメモ。

要件は以下の通り。

標準入力から、1行の文字列が与えられる
- 文字エンコーディングはUTF-8
- サロゲートペアも含まれることがある
- 文字数は3文字以上であることが保証される
入力文字列の部分文字列「[2, 4)」(つまり2～3文字目からなる文字列)を抽出
標準出力に、抽出した文字列を出力

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)
- Awk (GNU Awk 4.0.2)

Java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;

public class Main {
    /**
     * サロゲートを考慮したsubstring
     */
    private static String substring(String s, int startIndex, int endIndex) {
        StringBuilder sb = new StringBuilder();

        if (startIndex < 0) {
            throw new StringIndexOutOfBoundsException(startIndex);
        }
        int cpCount = s.codePointCount(0, s.length());
        if (cpCount < endIndex) {
            throw new StringIndexOutOfBoundsException(endIndex);
        }
        int subLen = endIndex - startIndex;
        if (subLen < 0) {
            throw new StringIndexOutOfBoundsException(subLen);
        }

        int idx = 0;
        for (int i = 0; i < s.length() && idx < endIndex; ++i) {
            char ch1 = s.charAt(i);
            if (startIndex <= idx && idx < endIndex) {
                sb.append(ch1);
            }
            if (Character.isSurrogate(ch1)) {
                char ch2 = s.charAt(++i);
                if (startIndex <= idx && idx < endIndex) {
                    sb.append(ch2);
                }
            }
            ++idx;
        }

        return sb.toString();
    }

    public static void main(String[] args) {
        try (BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            PrintWriter out = new PrintWriter(System.out)
        ) {
            String s = in.readLine();

            out.println(substring(s, 1, 3));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

サロゲートペアを考慮すると、Javaでは2つのchar値でサロゲートペアを表すことになるので、部分文字列を抽出する処理に一番手間がかかった。

C

#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");

    char str[1024];

    fgets(str, sizeof(str), stdin);
    while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
        str[strlen(str) - 1] = '\0';
    }

    wchar_t buf[1024];
    const char* p = str;
    mbsrtowcs(buf, &p, sizeof(buf), NULL);

    wchar_t wstr[3];
    memset(wstr, 0, sizeof(wstr));
    // 「2」は言うまでもなく、indexではなくlength
    wcsncpy(wstr, &buf[1], 2);
    fwprintf(stdout, L"%ls\n", wstr);

    return 0;
}

C++

#include <iostream>
#include <locale>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");
    wcout.imbue(locale("japanese"));

    wstring str;
    getline(wcin, str);

    // 「2」はindexではなくlengthであることに注意
    str = str.substr(1, 2);
    wcout << str << endl;

    return EXIT_SUCCESS;
}

PHP

<?php

$str = file_get_contents('php://stdin');
$str = preg_replace("/[\r\n]/", '', $str);

// 「2」はindexではなくlengthであることに注意
echo mb_substr($str, 1, 2, 'UTF-8') . PHP_EOL;

Python 2

import sys

s = sys.stdin.readline()
ustr = unicode(s, 'UTF-8')
ustr = ustr.replace('\n', '')
ustr = ustr.replace('\r', '')

print ustr[1:3].encode('UTF-8')

Python 3

import sys

b = sys.stdin.buffer.readline()
s = str(b, 'UTF-8')
s = s.replace('\n', '')
s = s.replace('\r', '')

print(s[1:3])

Ruby

str = STDIN.gets
str.chomp!()

# 「2」はindexではなくlengthであることに注意
print str[1, 2],"\n"

Perl

use Encode;

my $str = readline(STDIN);
chomp($str);

my $ustr = decode('UTF-8', $str);
# 「2」はindexではなくlengthであることに注意
print encode('UTF-8', substr($ustr, 1, 2)),"\n";

Go

package main

import (
    "fmt"
    "os"
    "io"
    "bufio"
)

func ReadLine(reader *bufio.Reader) (s string, err error) {
    prefix := false
    buf := make([]byte, 0)
    var line []byte
    for {
        line, prefix, err = reader.ReadLine()
        if err == io.EOF {
            return
        }
        buf = append(buf, line...)
        if prefix {
            continue
        }
        s = string(buf)
        return
    }
}

func main() {
    stdin := bufio.NewReader(os.Stdin)
    s, _ := ReadLine(stdin)

    runes := []rune(s)
    fmt.Println(string(runes[1:3]))
}

bash

#! /bin/bash

IFS= read s

echo "${s}" | sed -e 's/^.\(..\).*$/\1/g'

Awk

{
    gsub(/[\r\n]/, "");
    # 第3パラメータの「2」はindexではなくlengthであることに注意
    print substr($0, 2, 2);
}

2017-12-10

各言語で指定したディレクトリ内のファイル一覧を取得してみる

Java C C++ PHP Python Ruby Perl Go bash

各言語で指定したディレクトリ直下のファイル一覧を取得するプログラムを書いてみたメモ。

要件は以下の通り。

コマンドライン引数には、ディレクトリ名が1つ指定される
指定されたディレクトリから直下にあるファイルのファイル名一覧を読む
- ファイルの個数は高々256個とする
- ディレクトリか通常ファイルしか存在しない
- ディレクトリは除外する
- いわゆる隠しファイル("."で始まるファイル名のファイル)は含める
読み込んだファイル名一覧をファイル名の辞書順でソートする
ファイル名一覧を、指定されたディレクトリの下の「result/out.txt」に書き込む
- 「result」ディレクトリはあらかじめ用意してあるので、存在チェック等は不要

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

Java

import java.io.File;
import java.io.FileFilter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        if (args.length != 1) {
            System.err.println("Usage: java Main dirname");
            System.exit(1);
            return;
        }

        File dir = new File(args[0]);
        if (!dir.isDirectory()) {
            System.err.println(dir + ": No such directory");
            System.exit(1);
            return;
        }

        // ディレクトリからファイルの一覧を読み込み
        File[] files = dir.listFiles(new FileFilter() {
            public boolean accept(File file) {
                return file.isFile();
            }
        });

        // ファイル名の辞書順でソート
        List<String> filenames = new ArrayList<>();
        for (File file : files) {
            filenames.add(file.getName());
        }
        Collections.sort(filenames);

        // ファイル名一覧の出力
        try (PrintWriter out = new PrintWriter(new FileWriter(new File(dir + "/result/out.txt")))) {
            for (String filename : filenames) {
                out.println(filename);
            }
            out.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

C

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>
#include <limits.h>

int isdir(const char* path) {
    struct stat st;
    if (stat(path, &st)) {
        return 0;
    }
    return ((st.st_mode & S_IFMT) == S_IFDIR);
}

int isfile(const char* path) {
    struct stat st;
    if (stat(path, &st)) {
        return 0;
    }
    return ((st.st_mode & S_IFMT) != S_IFDIR);
}

int cmp(const void* p1, const void* p2) {
    const char* str1 = (const char*)p1;
    const char* str2 = (const char*)p2;
    return strcmp(str1, str2);
}

int main(int argc, char** argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s dirname\n", argv[0]);
        exit(1);
    }

    char dir[PATH_MAX];
    strcpy(dir, argv[1]);
    if (!isdir(dir)) {
        fprintf(stderr, "%s: No such directory\n", dir);
        exit(1);
    }

    // ディレクトリからファイルの一覧を読み込み
    DIR* dp = opendir(dir);
    if (!dp) {
        perror(dir);
        exit(1);
    }
    char filenames[256][PATH_MAX];
    int ct = 0;
    struct dirent* entry;
    while ((entry = readdir(dp))) {
        char tmp[PATH_MAX];
        sprintf(tmp, "%s/%s", dir, entry->d_name);
        if (isfile(tmp)) {
            strcpy(filenames[ct], entry->d_name);
            ++ct;
        }
    }
    closedir(dp);

    // ファイル名の辞書順でソート
    qsort(filenames, ct, sizeof(char) * PATH_MAX, cmp);

    // ファイル名一覧の出力
    char outFile[PATH_MAX];
    sprintf(outFile, "%s/result/out.txt", dir);
    FILE* outFp = fopen(outFile, "w");
    if (!outFp) {
        perror(outFile);
        exit(1);
    }
    for (int i = 0; i < ct; ++i) {
        fprintf(outFp, "%s\n", filenames[i]);
    }
    fclose(outFp);

    return 0;
}

C++

#include <iostream>
#include <fstream>
#include <vector>
#include <boost/filesystem.hpp>

using namespace std;
using namespace boost::filesystem;

int main(int argc, char** argv) {
    if (argc != 2) {
        cerr << "Usage: " << argv[0] << " dirname" << endl;
        return EXIT_FAILURE;
    }

    path dir(argv[1]);
    if (!exists(dir) || !is_directory(dir)) {
        cerr << dir << ": No such directory" << endl;
        return EXIT_FAILURE;
    }

    // ディレクトリからファイルの一覧を読み込み
    vector<string> filenames;
    try {
        directory_iterator end;
        for (directory_iterator it(dir); it != end; ++it) {
            if (!is_directory(it->path())) {
                filenames.push_back(it->path().filename().string());
            }
        }
    } catch (const filesystem_error& e) {
        cerr << e.what() << endl;
    }

    // ファイル名の辞書順でソート
    sort(filenames.begin(), filenames.end());

    // ファイル名一覧の出力
    string outFile = dir.string() + "/result/out.txt";
    ofstream outFs(outFile, ios::binary);
    if (!outFs) {
        cerr << outFile << ": Cannot open file" << endl;
        exit(1);
    }
    for (string& filename : filenames) {
        outFs << filename << endl;
    }
    outFs.close();

    return EXIT_SUCCESS;
}

Boostに頼りました。コンパイルには「-lboost_filesystem -lboost_system」が必要。

PHP

<?php

if (count($argv) != 2) {
    file_put_contents("php://stderr", "Usage: {$argv[0]} dirname" . PHP_EOL);
    exit(1);
}
$dir = $argv[1];
if (!is_dir($dir)) {
    file_put_contents('php://stderr', "{$dir}: No such directory" . PHP_EOL);
    exit(1);
}

// ディレクトリからファイルの一覧を読み込み
$dp = opendir($dir);
if (!$dp) {
    file_put_contents('php://stderr', "{$dir}: Cannot open directory" . PHP_EOL);
    exit(1);
}
$filenames = array();
while (($f = readdir($dp))) {
    if (is_file("{$dir}/{$f}")) {
        $filenames[] = $f;
    }
}
closedir($dp);

// ファイル名の辞書順でソート
sort($filenames);

// ファイル名一覧の出力
$outFile = "{$dir}/result/out.txt";
$fp = fopen($outFile, 'w');
if (!$fp) {
    file_put_contents('php://stderr', "{$outFile}: Cannot open output file" . PHP_EOL);
    exit(1);
}
foreach ($filenames as $filename) {
    fprintf($fp, "%s\n", $filename);
}
fclose($fp);

Python 2

# -*- coding: utf-8 -*-
import sys
import os

if len(sys.argv) != 2:
        sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n");
        exit(1)
dirPath = sys.argv[1];
if not os.path.isdir(dirPath):
        sys.stderr.write(dirPath + ": No such directory\n");
        exit(1)

# ディレクトリからファイルの一覧を読み込み
filenames = []
for f in os.listdir(dirPath):
        if os.path.isfile(dirPath + '/' + f):
                filenames.append(f)

# ファイル名の辞書順でソート
filenames.sort()

# ファイル名一覧の出力
outFile = dirPath + "/result/out.txt"
outFp = os.open(outFile, os.O_WRONLY | os.O_CREAT)
for f in filenames:
        os.write(outFp, f + "\n");
os.close(outFp)

Dir.globを使うという手もあるらしいのだが、試したら、隠しファイルを取得するために2回呼び出さないといけないことが分かったので、今回は正攻法で攻めた。

Python 3

import sys
import os

if len(sys.argv) != 2:
        sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n");
        exit(1)
dirPath = sys.argv[1];
if not os.path.isdir(dirPath):
        sys.stderr.write(dirPath + ": No such directory\n");
        exit(1)

# ディレクトリからファイルの一覧を読み込み
filenames = []
for f in os.listdir(dirPath):
        if os.path.isfile(dirPath + '/' + f):
                filenames.append(f)

# ファイル名の辞書順でソート
filenames.sort()

# ファイル名一覧の出力
outFile = dirPath + "/result/out.txt"
outFp = os.open(outFile, os.O_WRONLY | os.O_CREAT)
for f in filenames:
        os.write(outFp, f.encode('utf-8') + b"\n");
os.close(outFp)

Ruby

if ARGV.length != 1
    STDERR.puts('Usage: ' + __FILE__ + ' dirname')
    exit 1
end

dir = ARGV[0]
if !File.directory?(dir)
    STDERR.puts(dir + ': No such directory')
    exit 1
end

# ディレクトリからファイルの一覧を読み込み
filenames = []
Dir.foreach(dir).each do |filename|
    if File.file?(dir + '/' + filename)
        filenames.push(File.basename(filename))
    end
end

# ファイル名の辞書順でソート
filenames.sort!

# ファイル名一覧の出力
outFile = dir + '/result/out.txt'
outFp = File.open(outFile, mode = 'wb')
for filename in filenames
    outFp.puts(filename + "\n")
end
outFp.close()

Perl

if (@ARGV != 1) {
    print(STDERR 'Usage: ' . __FILE__ . " dirname\n");
    exit(1);
}
my $dir = $ARGV[0];
if (! -d $dir) {
    print(STDERR $dir . ": No such directory\n");
    exit(1);
}

# ディレクトリからファイルの一覧を読み込み
my $dp;
my $res = opendir($dp, $dir);
if (!$res) {
    print(STDERR $dir . ':' . $! . "\n");
    exit(1);
}
my @filenames = ();
my $ct = 0;
while (my $filename = readdir($dp)) {
    if (-f $dir . '/' . $filename) {
        $filenames[$ct++] = $filename;
    }
}
closedir($dp);

# ファイル名の辞書順でソート
@filenames = sort(@filenames);

# ファイル名一覧の出力
my $outFile = $dir . '/result/out.txt';
my $outFp;
my $res = open($outFp, '>', $outFile);
if (!$res) {
    print(STDERR $outFile . ':' . $! . "\n");
    exit(1);
}
for (my $i = 0; $i < @filenames; ++$i) {
    print($outFp $filenames[$i] . "\n");
}
close($outFp);

glob()を使う方法もあるらしいのだが、隠しファイルを取得するために2回呼び出さないといけない感じだったので、今回は正攻法で攻めた。

Go

package main

import (
    "os"
    "fmt"
    "io/ioutil"
    "sort"
)

func main() {
    if len(os.Args) != 2 {
        fmt.Fprintln(os.Stderr, "Usage: " + os.Args[0] + " dirname")
        os.Exit(1)
    }
    dir := os.Args[1]
    statInfo, _ := os.Stat(dir)
    if !statInfo.IsDir() {
        fmt.Fprintln(os.Stderr, dir + ": No such directory")
        os.Exit(1)
    }

    // ディレクトリからファイルの一覧を読み込み
    files, err := ioutil.ReadDir(dir)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    count := 0
    for _, file := range files {
        if (!file.IsDir()) {
            count += 1
        }
    }
    filenames := make([]string, count)
    i := 0
    for _, file := range files {
        if (!file.IsDir()) {
            filenames[i] = file.Name()
            i += 1
        }
    }

    // ファイル名の辞書順でソート
    sort.Strings(filenames)

    // ファイル名一覧の出力
    outFile := dir + "/result/out.txt"
    outFp, err := os.Create(outFile)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    for _, filename := range filenames {
        fmt.Fprintln(outFp, filename);
    }
}

"container/list"を使ってもよかった気がするが、ソートが面倒だったので逃げた。

bash

#! /bin/bash

dir=$1
if [ -z "${dir}" ]; then
    echo "Usage: ${0} dirname"
    exit 1
fi
if [ ! -d "${dir}" ]; then
    echo "${dir}: No such directory"
    exit 1
fi

ls -1aF "${dir}" | grep -v '/$' | tr -d / | LANG=C sort > "${dir}/result/out.txt"

「LANG=C」しないと、日本語ファイル名のファイルを含む場合にソート順が期待通りにならない。

2017-12-07

各言語での整数型の最大値と最小値

Java C C++ PHP Python Ruby Perl Go bash

唐突に、各言語での整数型の最大値と最小値をまとめてみようと思ったメモ。

環境

手元にあるものということで、環境は以下のものに限定する。なお、32ビット環境は、このために急きょ作った。

CentOS 6 (32ビット)
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.4.7)
  - -std=gnu99でコンパイル
- C++ (g++ (GCC) 4.4.7)
  - -std=gnu++0xでコンパイル
- PHP (PHP 5.3.3 (cli))
- Python 2 (Python 2.6.6)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 1.8.7 (2013-06-27 patchlevel 374))
- Perl (v5.10.1)
- Go (go version go1.7.6 linux/386)
- bash (4.1.2(2)-release)
CentOS 7 (64ビット)
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

以下、検証に使ったソースと、実行結果(32ビット環境と64ビット環境のそれぞれの実行結果のsdiff)と、補足事項を各言語ごとに記載していく。

どの言語においても、概ね以下の形で出力を出している。

最大値を求めて出力
最大値に＋１して、循環することの確認
最小値を求めて出力
最小値に－１して、循環することの確認

Java

public class Main {
    public static void main(String[] args) {
        {
            byte a;
            a = Byte.MAX_VALUE;
            System.out.println("byte:max  = " + a);
            ++a;
            System.out.println("      +1  = " + a);

            a = Byte.MIN_VALUE;
            System.out.println("byte:min  = " + a);
            --a;
            System.out.println("      -1  = " + a);
        }
        {
            short a;
            a = Short.MAX_VALUE;
            System.out.println("short:max = " + a);
            ++a;
            System.out.println("       +1 = " + a);

            a = Short.MIN_VALUE;
            System.out.println("short:min = " + a);
            --a;
            System.out.println("       -1 = " + a);
        }
        {
            int a;
            a = Integer.MAX_VALUE;
            System.out.println("int:max   = " + a);
            ++a;
            System.out.println("     +1   = " + a);

            a = Integer.MIN_VALUE;
            System.out.println("int:min   = " + a);
            --a;
            System.out.println("     -1   = " + a);
        }
        {
            long a;
            a = Long.MAX_VALUE;
            System.out.println("long:max  = " + a);
            ++a;
            System.out.println("      +1  = " + a);

            a = Long.MIN_VALUE;
            System.out.println("long:min  = " + a);
            --a;
            System.out.println("      -1  = " + a);
        }
    }
}

===32bit===                                                     ===64bit===
byte:max  = 127                                                 byte:max  = 127
      +1  = -128                                                      +1  = -128
byte:min  = -128                                                byte:min  = -128
      -1  = 127                                                       -1  = 127
short:max = 32767                                               short:max = 32767
       +1 = -32768                                                     +1 = -32768
short:min = -32768                                              short:min = -32768
       -1 = 32767                                                      -1 = 32767
int:max   = 2147483647                                          int:max   = 2147483647
     +1   = -2147483648                                              +1   = -2147483648
int:min   = -2147483648                                         int:min   = -2147483648
     -1   = 2147483647                                               -1   = 2147483647
long:max  = 9223372036854775807                                 long:max  = 9223372036854775807
      +1  = -9223372036854775808                                      +1  = -9223372036854775808
long:min  = -9223372036854775808                                long:min  = -9223372036854775808
      -1  = 9223372036854775807                                       -1  = 9223372036854775807

さすがにJavaは、環境によって最大値や最小値が変わることはない。

C

#include <stdio.h>
#include <limits.h>

int main(int argc, char** argv) {
    {
        short a;
        a = SHRT_MAX;
        printf("short:max       = %d\n", a);
        ++a;
        printf("             +1 = %d\n", a);

        a = SHRT_MIN;
        printf("short:min       = %d\n", a);
        --a;
        printf("             -1 = %d\n", a);
    }

    {
        unsigned short a;
        a = USHRT_MAX;
        printf("ushort:max      = %u\n", a);
        ++a;
        printf("             +1 = %u\n", a);

        a = 0;
        printf("ushort:min      = %u\n", a);
        --a;
        printf("             -1 = %u\n", a);
    }

    {
        int a;
        a = INT_MAX;
        printf("int:max         = %d\n", a);
        ++a;
        printf("             +1 = %d\n", a);

        a = INT_MIN;
        printf("int:min         = %d\n", a);
        --a;
        printf("             -1 = %d\n", a);
    }

    {
        unsigned int a;
        a = UINT_MAX;
        printf("uint:max        = %u\n", a);
        ++a;
        printf("             +1 = %u\n", a);

        a = 0;
        printf("uint:min        = %u\n", a);
        --a;
        printf("             -1 = %u\n", a);
    }

    {
        long a;
        a = LONG_MAX;
        printf("long:max        = %ld\n", a);
        ++a;
        printf("             +1 = %ld\n", a);

        a = LONG_MIN;
        printf("long:min        = %ld\n", a);
        --a;
        printf("             -1 = %ld\n", a);
    }

    {
        unsigned long a;
        a = ULONG_MAX;
        printf("ulong:max       = %lu\n", a);
        ++a;
        printf("             +1 = %lu\n", a);

        a = 0;
        printf("ulong:min       = %lu\n", a);
        --a;
        printf("             -1 = %lu\n", a);
    }

    {
        long long a;
        a = LLONG_MAX;
        printf("long long:max   = %lld\n", a);
        ++a;
        printf("             +1 = %lld\n", a);

        a = LLONG_MIN;
        printf("long long:min   = %lld\n", a);
        --a;
        printf("             -1 = %lld\n", a);
    }

    {
        unsigned long long a;
        a = ULLONG_MAX;
        printf("ulong long:max  = %llu\n", a);
        ++a;
        printf("             +1 = %llu\n", a);

        a = 0;
        printf("ulong long:min  = %llu\n", a);
        --a;
        printf("             -1 = %llu\n", a);
    }

    return 0;
}

===32bit===                                                     ===64bit===
short:max       = 32767                                         short:max       = 32767
             +1 = -32768                                                     +1 = -32768
short:min       = -32768                                        short:min       = -32768
             -1 = 32767                                                      -1 = 32767
ushort:max      = 65535                                         ushort:max      = 65535
             +1 = 0                                                          +1 = 0
ushort:min      = 0                                             ushort:min      = 0
             -1 = 65535                                                      -1 = 65535
int:max         = 2147483647                                    int:max         = 2147483647
             +1 = -2147483648                                                +1 = -2147483648
int:min         = -2147483648                                   int:min         = -2147483648
             -1 = 2147483647                                                 -1 = 2147483647
uint:max        = 4294967295                                    uint:max        = 4294967295
             +1 = 0                                                          +1 = 0
uint:min        = 0                                             uint:min        = 0
             -1 = 4294967295                                                 -1 = 4294967295
long:max        = 2147483647                                  | long:max        = 9223372036854775807
             +1 = -2147483648                                 |              +1 = -9223372036854775808
long:min        = -2147483648                                 | long:min        = -9223372036854775808
             -1 = 2147483647                                  |              -1 = 9223372036854775807
ulong:max       = 4294967295                                  | ulong:max       = 18446744073709551615
             +1 = 0                                                          +1 = 0
ulong:min       = 0                                             ulong:min       = 0
             -1 = 4294967295                                  |              -1 = 18446744073709551615
long long:max   = 9223372036854775807                           long long:max   = 9223372036854775807
             +1 = -9223372036854775808                                       +1 = -9223372036854775808
long long:min   = -9223372036854775808                          long long:min   = -9223372036854775808
             -1 = 9223372036854775807                                        -1 = 9223372036854775807
ulong long:max  = 18446744073709551615                          ulong long:max  = 18446744073709551615
             +1 = 0                                                          +1 = 0
ulong long:min  = 0                                             ulong long:min  = 0
             -1 = 18446744073709551615                                       -1 = 18446744073709551615

違いが出たのはlong/unsigned longの部分。32ビット環境ではintと同じで、64ビット環境ではlong longと同じ。

C++

#include <iostream>
#include <limits>

using namespace std;

int main(int argc, char** argv) {
    {
        short a;
        a = numeric_limits<short>::max();
        cout << "short:max      = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<short>::min();
        cout << "short:min      = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned short a;
        a = numeric_limits<unsigned short>::max();
        cout << "ushort:max     = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned short>::min();
        cout << "ushort:min     = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        int a;
        a = numeric_limits<int>::max();
        cout << "int:max        = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<int>::min();
        cout << "int:min        = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned int a;
        a = numeric_limits<unsigned int>::max();
        cout << "uint:max       = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned int>::min();
        cout << "uint:min       = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        long a;
        a = numeric_limits<long>::max();
        cout << "long:max       = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<long>::min();
        cout << "long:min       = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned long a;
        a = numeric_limits<unsigned long>::max();
        cout << "ulong:max      = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned long>::min();
        cout << "ulong:min      = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        long long a;
        a = numeric_limits<long long>::max();
        cout << "long long:max  = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<long long>::min();
        cout << "long long:min  = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned long long a;
        a = numeric_limits<unsigned long long>::max();
        cout << "ulong long:max = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned long long>::min();
        cout << "ulong long:min = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    return EXIT_SUCCESS;
}

===32bit===                                                     ===64bit===
short:max      = 32767                                          short:max      = 32767
            +1 = -32768                                                     +1 = -32768
short:min      = -32768                                         short:min      = -32768
            -1 = 32767                                                      -1 = 32767
ushort:max     = 65535                                          ushort:max     = 65535
            +1 = 0                                                          +1 = 0
ushort:min     = 0                                              ushort:min     = 0
            -1 = 65535                                                      -1 = 65535
int:max        = 2147483647                                     int:max        = 2147483647
            +1 = -2147483648                                                +1 = -2147483648
int:min        = -2147483648                                    int:min        = -2147483648
            -1 = 2147483647                                                 -1 = 2147483647
uint:max       = 4294967295                                     uint:max       = 4294967295
            +1 = 0                                                          +1 = 0
uint:min       = 0                                              uint:min       = 0
            -1 = 4294967295                                                 -1 = 4294967295
long:max       = 2147483647                                   | long:max       = 9223372036854775807
            +1 = -2147483648                                  |             +1 = -9223372036854775808
long:min       = -2147483648                                  | long:min       = -9223372036854775808
            -1 = 2147483647                                   |             -1 = 9223372036854775807
ulong:max      = 4294967295                                   | ulong:max      = 18446744073709551615
            +1 = 0                                                          +1 = 0
ulong:min      = 0                                              ulong:min      = 0
            -1 = 4294967295                                   |             -1 = 18446744073709551615
long long:max  = 9223372036854775807                            long long:max  = 9223372036854775807
            +1 = -9223372036854775808                                       +1 = -9223372036854775808
long long:min  = -9223372036854775808                           long long:min  = -9223372036854775808
            -1 = 9223372036854775807                                        -1 = 9223372036854775807
ulong long:max = 18446744073709551615                           ulong long:max = 18446744073709551615
            +1 = 0                                                          +1 = 0
ulong long:min = 0                                              ulong long:min = 0
            -1 = 18446744073709551615                                       -1 = 18446744073709551615

違いが出たのはlong/unsigned longの部分。32ビット環境ではintと同じで、64ビット環境ではlong longと同じ。

PHP

<?php

$a = 1;
while (($a << 1) + 1 > $a) {
    $a <<= 1;
    $a += 1;
}
echo "int:max = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);
++$a;
echo "     +1 = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);

$a = -1;
while (($a << 1) < $a) {
    $a <<= 1;
}
echo "int:min = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);
--$a;
echo "     -1 = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);

===32bit===                                                     ===64bit===
int:max = 2147483647                                          | int:max = 9223372036854775807
        = int(2147483647)                                     |         = int(9223372036854775807)
     +1 = 2147483648                                          |      +1 = 9.2233720368548E+18
        = float(2147483648)                                   |         = float(9.2233720368548E+18)
int:min = -2147483648                                         | int:min = -9223372036854775808
        = int(-2147483648)                                    |         = int(-9223372036854775808)
     -1 = -2147483649                                         |      -1 = -9.2233720368548E+18
        = float(-2147483649)                                  |         = float(-9.2233720368548E+18)

最大値、最小値を表す定数が無いので、計算によって求めている。

最初だまされたのは、32ビット環境で最大値に＋１、最小値に－１したときに、一見するとint型に収まっているように見えたこと。 var_dump()すると、float型に変わっていることが分かる。

Python 2

a = 1
ct = 0
while ct < 128 and (a << 1) + 1 > a:
    a <<= 1
    a += 1
    ct += 1
print "long:max? = ",a
print "          = ",type(a)
a += 1
print "       +1 = ",a
print "          = ",type(a)

a = -1
ct = 0
while ct < 128 and (a << 1) < a:
    a <<= 1
    ct += 1
print "long:min? = ",a
print "          = ",type(a)
a -= 1
print "       -1 = ",a
print "          = ",type(a)

===32bit===                                                     ===64bit===
long:max? =  680564733841876926926749214863536422911            long:max? =  680564733841876926926749214863536422911
          =  <type 'long'>                                                =  <type 'long'>
       +1 =  680564733841876926926749214863536422912                   +1 =  680564733841876926926749214863536422912
          =  <type 'long'>                                                =  <type 'long'>
long:min? =  -340282366920938463463374607431768211456           long:min? =  -340282366920938463463374607431768211456
          =  <type 'long'>                                                =  <type 'long'>
       -1 =  -340282366920938463463374607431768211457                  -1 =  -340282366920938463463374607431768211457
          =  <type 'long'>                                                =  <type 'long'>

最大値や最小値という概念が無いことを知っていたので、128ビットまで計算したところで打ち切っている。計算で出てきた数値に＋１、－１してもまだ余地があることが分かる。

Python 3

a = 1
ct = 0
while ct < 128 and (a << 1) + 1 > a:
    a <<= 1
    a += 1
    ct += 1
print("ing:max? = ",a)
print("         = ",type(a))
a += 1
print("      +1 = ",a)
print("         = ",type(a))

a = -1
ct = 0
while ct < 128 and (a << 1) < a:
    a <<= 1
    ct += 1
print("int:min? = ",a)
print("         = ",type(a))
a -= 1
print("      -1 = ",a)
print("         = ",type(a))

===32bit===                                                     ===64bit===
ing:max? =  680564733841876926926749214863536422911             ing:max? =  680564733841876926926749214863536422911
         =  <class 'int'>                                                =  <class 'int'>
      +1 =  680564733841876926926749214863536422912                   +1 =  680564733841876926926749214863536422912
         =  <class 'int'>                                                =  <class 'int'>
int:min? =  -340282366920938463463374607431768211456            int:min? =  -340282366920938463463374607431768211456
         =  <class 'int'>                                                =  <class 'int'>
      -1 =  -340282366920938463463374607431768211457                  -1 =  -340282366920938463463374607431768211457
         =  <class 'int'>                                                =  <class 'int'>

こちらはPython 2の場合と同じ。

Ruby

a = 1
ct = 0
while ct < 128 && (a << 1) + 1 > a
    a <<= 1
    a += 1
    ct += 1
end
print "int:max? = ",a,"\n"
a += 1
print "      +1 = ",a,"\n"

a = -1
ct = 0
while ct < 128 && (a << 1) < a
    a <<= 1
    ct += 1
end
print "int:min? = ",a,"\n"
a -= 1
print "      -1 = ",a,"\n"

===32bit===                                                     ===64bit===
int:max? = 680564733841876926926749214863536422911              int:max? = 680564733841876926926749214863536422911
      +1 = 680564733841876926926749214863536422912                    +1 = 680564733841876926926749214863536422912
int:min? = -340282366920938463463374607431768211456             int:min? = -340282366920938463463374607431768211456
      -1 = -340282366920938463463374607431768211457                   -1 = -340282366920938463463374607431768211457

Rubyも最大値や最小値が無いことを知っていたので、Pythonと同じく128ビットで打ち切っている。

Perl

# 参考：http://d.hatena.ne.jp/sardine/20131026
my $a = ~0;
print "int:max = ",$a,"\n";
++$a;
print "     +1 = ",$a,"\n";

my $a = -(~0 >> 1) - 1;
print "int:min = ",$a,"\n";
--$a;
print "     -1 = ",$a,"\n";

===32bit===                                                     ===64bit===
int:max = 4294967295                                          | int:max = 18446744073709551615
     +1 = 4294967296                                          |      +1 = 1.84467440737096e+19
int:min = -2147483648                                         | int:min = -9223372036854775808
     -1 = -2147483649                                         |      -1 = -9.22337203685478e+18

最初、ビット演算しても期待した結果が得られなくてはまっていた。ソースに書かれた参考サイトの情報が無ければ変な結果を得ていただろう。

32ビットの結果がちょっと変で、最大値／最小値を突き抜けて＋１／－１できているように見える。これは何なんだろう・・・

(2017/12/08)様子が分かったので追記。

あるサイト（perl - check if a number is int or float - Stack Overflow）を参考に、変数のダンプ情報を出すようにしてみた。

use Devel::Peek;

# 参考：http://d.hatena.ne.jp/sardine/20131026
# 参考：https://stackoverflow.com/questions/4094036/check-if-a-number-is-int-or-float
my $a = ~0;
print "int:max = ",$a,"\n";
Dump($a);
print STDERR "\n";
++$a;
print "     +1 = ",$a,"\n";
Dump($a);
print STDERR "\n";

my $a = -(~0 >> 1) - 1;
print "int:min = ",$a,"\n";
Dump($a);
print STDERR "\n";
--$a;
print "     -1 = ",$a,"\n";
Dump($a);

すると、以下のような出力が得られる。

===32bit===                                                     ===64bit===
int:max = 4294967295                                          | int:max = 18446744073709551615
SV = IV(0x8139d84) at 0x8139d88                               | SV = IV(0x106a778) at 0x106a788
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK,IsUV)                                   FLAGS = (PADMY,IOK,pIOK,IsUV)
  UV = 4294967295                                             |   UV = 18446744073709551615

     +1 = 4294967296                                          |      +1 = 1.84467440737096e+19
SV = PVNV(0x811e9e0) at 0x8139d88                             | SV = PVNV(0x104cfe0) at 0x106a788
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,NOK,POK,pNOK,pPOK)                               FLAGS = (PADMY,NOK,POK,pNOK,pPOK)
  IV = 0                                                          IV = 0
  NV = 4294967296                                             |   NV = 1.84467440737096e+19
  PV = 0x81417a0 "4294967296"\0                               |   PV = 0x106d530 "1.84467440737096e+19"\0
  CUR = 10                                                    |   CUR = 20
  LEN = 36                                                    |   LEN = 40

int:min = -2147483648                                         | int:min = -9223372036854775808
SV = IV(0x8139e14) at 0x8139e18                               | SV = IV(0x106a940) at 0x106a950
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)                                        FLAGS = (PADMY,IOK,pIOK)
  IV = -2147483648                                            |   IV = -9223372036854775808

     -1 = -2147483649                                         |      -1 = -9.22337203685478e+18
SV = PVNV(0x811e9f4) at 0x8139e18                             | SV = PVNV(0x104d000) at 0x106a950
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,NOK,POK,pNOK,pPOK)                               FLAGS = (PADMY,NOK,POK,pNOK,pPOK)
  IV = -2147483648                                            |   IV = -9223372036854775808
  NV = -2147483649                                            |   NV = -9.22337203685478e+18
  PV = 0x8145810 "-2147483649"\0                              |   PV = 0x106d820 "-9.22337203685478e+18"\0
  CUR = 11                                                    |   CUR = 21
  LEN = 36                                                    |   LEN = 40

これを見ると、最大値／最小値に＋１／－１した場合はfloatに自動変換されていることが分かる。これですっきり。

Go

package main

import (
    "fmt"
)

func main() {
    var a = 0
    var ct = 0

    a = 1
    ct = 0
    for ct < 128 && (a << 1) + 1 > a {
        a <<= 1
        a += 1
        ct += 1
    }
    fmt.Printf("int:max = %d\n", a)
    a += 1
    fmt.Printf("     +1 = %d\n", a)

    a = -1
    ct = 0
    for ct < 128 && (a << 1) < a {
        a <<= 1
        ct += 1
    }
    fmt.Printf("int:min = %d\n", a)
    a -= 1
    fmt.Printf("     -1 = %d\n", a)
}

===32bit===                                                     ===64bit===
int:max = 2147483647                                          | int:max = 9223372036854775807
     +1 = -2147483648                                         |      +1 = -9223372036854775808
int:min = -2147483648                                         | int:min = -9223372036854775808
     -1 = 2147483647                                          |      -1 = 9223372036854775807

Go言語は32ビット／64ビットの影響を受けるのだなとちょっと意外だった。

bash

#! /bin/bash

a=1
while [ $(((a << 1) + 1)) -gt ${a} ]; do
    a=$(((a << 1) + 1))
done
echo "int:max = ${a}"
a=$((a + 1))
echo "     +1 = ${a}"

a=-1
while [ $((a << 1)) -lt ${a} ]; do
    a=$((a << 1))
done
echo "int:min = ${a}"
a=$((a - 1))
echo "     -1 = ${a}"

===32bit===                                                     ===64bit===
int:max = 9223372036854775807                                   int:max = 9223372036854775807
     +1 = -9223372036854775808                                       +1 = -9223372036854775808
int:min = -9223372036854775808                                  int:min = -9223372036854775808
     -1 = 9223372036854775807                                        -1 = 9223372036854775807

逆に、シェルスクリプトは32ビット／64ビットの影響を受けると思っていたので意外だった。

まとめ

一覧表にしてみる。

		32bit max	32bit min	64bit max	64bit min
Java	byte	127	-128	127	-128
	short	32767	-32768	32767	-32768
	int	2147483647	-2147483648	2147483647	-2147483648
	long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
C	short	32767	-32768	32767	-32768
	unsigned short	65535	0	65535	0
	int	2147483647	-2147483648	2147483647	-2147483648
	unsigned int	4294967295	0	4294967295	0
	long	2147483647	-2147483648	9223372036854775807	-9223372036854775808
	unsigned long	4294967295	0	18446744073709551615	0
	long long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
	unsigned long long	18446744073709551615	0	18446744073709551615	0
C++	short	32767	-32768	32767	-32768
	unsigned short	65535	0	65535	0
	int	2147483647	-2147483648	2147483647	-2147483648
	unsigned int	4294967295	0	4294967295	0
	long	2147483647	-2147483648	9223372036854775807	-9223372036854775808
	unsigned long	4294967295	0	18446744073709551615	0
	long long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
	unsigned long long	18446744073709551615	0	18446744073709551615	0
PHP	int	2147483647	-2147483648	9223372036854775807	-9223372036854775808
Python 2	long	∞	-∞	∞	-∞
Python 3	ing	∞	-∞	∞	-∞
Ruby	int	∞	-∞	∞	-∞
Perl	int	4294967295	-2147483648	18446744073709551615	-9223372036854775808
Go	int	2147483647	-2147483648	9223372036854775807	-9223372036854775808
bash	int	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808