2017-12-12

各言語で正規表現「^」「$」「\A」「\z」を試してみる

Java C C++ PHP Python Ruby Perl Go

徳丸浩の日記: 正規表現によるバリデーションでは ^ と $ ではなく \A と \z を使おう https://t.co/Lc20UYnwMT
— HHeLiBeX (@hhelibex) 2017年12月11日

ということで、あちこちから突っ込みが来ないことを祈りつつ(謎)、手元にある各言語でテストプログラムを書いてみたメモ。

入力は以下のような文字列。

abc
123
*+=

最大4つのパターンを試すが、すべてのパターンでマッチすると、

のように出力される。逆に、マッチしないパターンや、そもそも存在しないマッチ方法の場合は「0」や「-」をそれぞれ出力する。

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)

Java

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;
import java.io.Reader;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

public class Main {
    private static boolean matches(String str, String pattern, int flags) {
        Pattern p = Pattern.compile(pattern, flags);
        Matcher m = p.matcher(str);
        return m.matches();
    }

    public static void main(String[] args) {
        try (Reader in = new InputStreamReader(System.in);
            PrintWriter out = new PrintWriter(System.out)
        ) {
            char[] buf = new char[1024];
            int len = in.read(buf);
            String s = new String(buf, 0, len);
//          s = s.trim();

            if (matches(s, "^[0-9]+$", 0)) {
                out.print("1");
            } else {
                out.print("0");
            }

            if (matches(s, "^[0-9]+$", Pattern.MULTILINE)) {
                out.print("2");
            } else {
                out.print("0");
            }

            if (matches(s, "\\A[0-9]+\\z", 0)) {
                out.print("3");
            } else {
                out.print("0");
            }

            if (matches(s, "\\A[0-9]+\\z", Pattern.MULTILINE)) {
                out.print("4");
            } else {
                out.print("0");
            }

            out.println();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

出力。

どのパターンでもマッチしない。完全にマッチしないとダメなようだ。

C

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <regex.h>

int matches(const char* str, const char* pattern, int flags) {
    regex_t rb;
    if (regcomp(&rb, pattern, flags)) {
        perror(pattern);
        exit(1);
    }

    regmatch_t rm;
    int res;
    if (!regexec(&rb, str, 1, &rm, 0)) {
        res = 1;
    } else {
        res = 0;
    }

    regfree(&rb);

    return res;
}

int main(int argc, char** argv) {
    char str[1024];
    memset(str, '\0', sizeof(str));

    fread(str, sizeof(str), sizeof(char), stdin);
//  while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
//      str[strlen(str) - 1] = '\0';
//  }

    if (matches(str, "^[0-9]+$", REG_EXTENDED)) {
        printf("1");
    } else {
        printf("0");
    }

    // そもそも複数行モードが無い
    printf("-");

    // 文字列の先頭・末尾という正規表現が無い
    printf("-");

    // そもそも複数行モードが無い
    printf("-");

    printf("\n");

    return 0;
}

出力。

0---

まぁ、C言語は仕方がない。パターンが1つしかないので。

C++

#include <iostream>
#include <locale>
#include <string>
#include <boost/regex.hpp>

using namespace std;

bool matches(string str, const char* pattern, boost::match_flag_type flags) {
    boost::regex re(pattern);

    boost::smatch sm;
    return boost::regex_search(str, sm, re, flags);
}

int main(int argc, char** argv) {
    istreambuf_iterator<char> it(cin);
    istreambuf_iterator<char> last;
    string str(it, last);
//  str.erase(str.find_last_not_of("\r\n") + 1);

    if (matches(str, "^[0-9]+$", boost::regex_constants::match_single_line)) {
        cout << "1";
    } else {
        cout << "0";
    }

    if (matches(str, "^[0-9]+$", boost::regex_constants::match_default)) {
        cout << "2";
    } else {
        cout << "0";
    }

    if (matches(str, "\\A[0-9]+\\z", boost::regex_constants::match_single_line)) {
        cout << "3";
    } else {
        cout << "0";
    }

    if (matches(str, "\\A[0-9]+\\z", boost::regex_constants::match_default)) {
        cout << "4";
    } else {
        cout << "0";
    }

    cout << endl;

    return EXIT_SUCCESS;
}

出力。

これがうわさに聞く、複数行モードで「^」「$」を使うと部分文字列にマッチするというものか。

最初のパターンでわざわざ「boost::regex_constants::match_single_line」をフラグに指定していることから分かるように、C++(Boost)のデフォルトは複数行モードのようだ。

PHP

<?php

$s = file_get_contents('php://stdin');
//$s = trim($s);

if (preg_match("/^[0-9]+$/", $s)) {
    echo '1';
} else {
    echo '0';
}

if (preg_match("/^[0-9]+$/m", $s)) {
    echo '2';
} else {
    echo '0';
}

if (preg_match("/\A[0-9]+\z/", $s)) {
    echo '3';
} else {
    echo '0';
}

if (preg_match("/\A[0-9]+\z/m", $s)) {
    echo '4';
} else {
    echo '0';
}

echo PHP_EOL;

出力。

同様に、複数行モードだと「^」「$」を使うと部分文字列にマッチする。

Python 2 / 3

import sys
import re

s = sys.stdin.read()
#s = s.strip()

if re.search(r'^[0-9]+$', s):
    sys.stdout.write('1')
else:
    sys.stdout.write('0')

if re.search(r'^[0-9]+$', s, re.MULTILINE):
    sys.stdout.write('2')
else:
    sys.stdout.write('0')

if re.search(r'\A[0-9]+\Z', s):
    sys.stdout.write('3')
else:
    sys.stdout.write('0')

if re.search(r'\A[0-9]+\Z', s, re.MULTILINE):
    sys.stdout.write('4')
else:
    sys.stdout.write('0')

sys.stdout.write("\n")

出力。

同様に、複数行モードだと「^」「$」を使うと部分文字列にマッチする。

なお、文字列の末尾を表す正規表現が「\z」ではなく「\Z」となることに注意。

Ruby

s = STDIN.read
#s.chomp!

# 単一行モードが無いので。
print "-"

if s.match(/^[0-9]+$/)
    print "2"
else
    print "0"
end

# 単一行モードが無いので。
print "-"

if s.match(/\A[0-9]+\z/)
    print "4"
else
    print "0"
end

print "\n"

出力。

-2-0

調べた限りでは複数行モードしかなかったので、複数行モードのみの出力。

確かに、「^」「$」で部分文字列にマッチする。

Perl

my $s;
{
    local $/ = undef;
    $s = <STDIN>;
}
#chomp($s);

if ($s =~ /^[0-9]+$/) {
    print '1';
} else {
    print '0';
}

if ($s =~ /^[0-9]+$/m) {
    print '2';
} else {
    print '0';
}

if ($s =~ /\A[0-9]+\z/) {
    print '3';
} else {
    print '0';
}

if ($s =~ /\A[0-9]+\z/m) {
    print '4';
} else {
    print '0';
}

print "\n";

出力。

PHPと同様に、mフラグを付けてやると複数行モードで、「^」「$」を使用すると部分文字列にマッチする。

Go

package main

import (
    "fmt"
    "os"
    "regexp"
    "bufio"
    "io/ioutil"
//  "strings"
)

func main() {
    stdin := bufio.NewReader(os.Stdin)
    b, _ := ioutil.ReadAll(stdin)
    s := string(b)
//  s = strings.Trim(s, "\r\n")

    {
        m := regexp.MustCompile(`^[0-9]+$`)
        if m.MatchString(s) {
            fmt.Print("1")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`(?m)^[0-9]+$`)
        if m.MatchString(s) {
            fmt.Print("2")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`\A[0-9]+\z`)
        if m.MatchString(s) {
            fmt.Print("3")
        } else {
            fmt.Print("0")
        }
    }

    {
        m := regexp.MustCompile(`(?m)\A[0-9]+\z`)
        if m.MatchString(s) {
            fmt.Print("4")
        } else {
            fmt.Print("0")
        }
    }

    fmt.Println()
}

出力。

PHPと同様に、mフラグを付けてやると複数行モードで、「^」「$」を使用すると部分文字列にマッチする。

まとめ

表にまとめると、以下のような感じか。マッチするケースに「○」、マッチしないケースに「×」を入れている。存在しないパターンは「－」としている。

(1) 単一行モードで、「^」「$」を使ったパターン
(2) 複数行モードで、「^」「$」を使ったパターン
(3) 単一行モードで、「\A」「\z」を使ったパターン
(4) 複数行モードで、「\A」「\z」を使ったパターン

	(1)	(2)	(3)	(4)
Java	×	×	×	×
C	×	－	－	－
C++	×	○	×	×
PHP	×	○	×	×
Python 2 / 3	×	○	×	×
Ruby	－	○	－	×
Perl	×	○	×	×
Go	×	○	×	×

自分も「^」「$」をついつい使ってしまっていたので、気に留めておくことにしよう。

2017-12-11

各言語で部分文字列を取得してみる

Java C C++ PHP Python Ruby Perl Go bash Awk

各言語で入力された文字列の部分文字列を取得するプログラムを書いてみたメモ。

要件は以下の通り。

標準入力から、1行の文字列が与えられる
- 文字エンコーディングはUTF-8
- サロゲートペアも含まれることがある
- 文字数は3文字以上であることが保証される
入力文字列の部分文字列「[2, 4)」(つまり2～3文字目からなる文字列)を抽出
標準出力に、抽出した文字列を出力

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)
- Awk (GNU Awk 4.0.2)

Java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;

public class Main {
    /**
     * サロゲートを考慮したsubstring
     */
    private static String substring(String s, int startIndex, int endIndex) {
        StringBuilder sb = new StringBuilder();

        if (startIndex < 0) {
            throw new StringIndexOutOfBoundsException(startIndex);
        }
        int cpCount = s.codePointCount(0, s.length());
        if (cpCount < endIndex) {
            throw new StringIndexOutOfBoundsException(endIndex);
        }
        int subLen = endIndex - startIndex;
        if (subLen < 0) {
            throw new StringIndexOutOfBoundsException(subLen);
        }

        int idx = 0;
        for (int i = 0; i < s.length() && idx < endIndex; ++i) {
            char ch1 = s.charAt(i);
            if (startIndex <= idx && idx < endIndex) {
                sb.append(ch1);
            }
            if (Character.isSurrogate(ch1)) {
                char ch2 = s.charAt(++i);
                if (startIndex <= idx && idx < endIndex) {
                    sb.append(ch2);
                }
            }
            ++idx;
        }

        return sb.toString();
    }

    public static void main(String[] args) {
        try (BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            PrintWriter out = new PrintWriter(System.out)
        ) {
            String s = in.readLine();

            out.println(substring(s, 1, 3));
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

サロゲートペアを考慮すると、Javaでは2つのchar値でサロゲートペアを表すことになるので、部分文字列を抽出する処理に一番手間がかかった。

C

#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");

    char str[1024];

    fgets(str, sizeof(str), stdin);
    while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
        str[strlen(str) - 1] = '\0';
    }

    wchar_t buf[1024];
    const char* p = str;
    mbsrtowcs(buf, &p, sizeof(buf), NULL);

    wchar_t wstr[3];
    memset(wstr, 0, sizeof(wstr));
    // 「2」は言うまでもなく、indexではなくlength
    wcsncpy(wstr, &buf[1], 2);
    fwprintf(stdout, L"%ls\n", wstr);

    return 0;
}

C++

#include <iostream>
#include <locale>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");
    wcout.imbue(locale("japanese"));

    wstring str;
    getline(wcin, str);

    // 「2」はindexではなくlengthであることに注意
    str = str.substr(1, 2);
    wcout << str << endl;

    return EXIT_SUCCESS;
}

PHP

<?php

$str = file_get_contents('php://stdin');
$str = preg_replace("/[\r\n]/", '', $str);

// 「2」はindexではなくlengthであることに注意
echo mb_substr($str, 1, 2, 'UTF-8') . PHP_EOL;

Python 2

import sys

s = sys.stdin.readline()
ustr = unicode(s, 'UTF-8')
ustr = ustr.replace('\n', '')
ustr = ustr.replace('\r', '')

print ustr[1:3].encode('UTF-8')

Python 3

import sys

b = sys.stdin.buffer.readline()
s = str(b, 'UTF-8')
s = s.replace('\n', '')
s = s.replace('\r', '')

print(s[1:3])

Ruby

str = STDIN.gets
str.chomp!()

# 「2」はindexではなくlengthであることに注意
print str[1, 2],"\n"

Perl

use Encode;

my $str = readline(STDIN);
chomp($str);

my $ustr = decode('UTF-8', $str);
# 「2」はindexではなくlengthであることに注意
print encode('UTF-8', substr($ustr, 1, 2)),"\n";

Go

package main

import (
    "fmt"
    "os"
    "io"
    "bufio"
)

func ReadLine(reader *bufio.Reader) (s string, err error) {
    prefix := false
    buf := make([]byte, 0)
    var line []byte
    for {
        line, prefix, err = reader.ReadLine()
        if err == io.EOF {
            return
        }
        buf = append(buf, line...)
        if prefix {
            continue
        }
        s = string(buf)
        return
    }
}

func main() {
    stdin := bufio.NewReader(os.Stdin)
    s, _ := ReadLine(stdin)

    runes := []rune(s)
    fmt.Println(string(runes[1:3]))
}

bash

#! /bin/bash

IFS= read s

echo "${s}" | sed -e 's/^.\(..\).*$/\1/g'

Awk

{
    gsub(/[\r\n]/, "");
    # 第3パラメータの「2」はindexではなくlengthであることに注意
    print substr($0, 2, 2);
}

2017-12-10

各言語で指定したディレクトリ内のファイル一覧を取得してみる

Java C C++ PHP Python Ruby Perl Go bash

各言語で指定したディレクトリ直下のファイル一覧を取得するプログラムを書いてみたメモ。

要件は以下の通り。

コマンドライン引数には、ディレクトリ名が1つ指定される
指定されたディレクトリから直下にあるファイルのファイル名一覧を読む
- ファイルの個数は高々256個とする
- ディレクトリか通常ファイルしか存在しない
- ディレクトリは除外する
- いわゆる隠しファイル("."で始まるファイル名のファイル)は含める
読み込んだファイル名一覧をファイル名の辞書順でソートする
ファイル名一覧を、指定されたディレクトリの下の「result/out.txt」に書き込む
- 「result」ディレクトリはあらかじめ用意してあるので、存在チェック等は不要

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

Java

import java.io.File;
import java.io.FileFilter;
import java.io.FileWriter;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
import java.util.Collections;
import java.util.List;

public class Main {
    public static void main(String[] args) {
        if (args.length != 1) {
            System.err.println("Usage: java Main dirname");
            System.exit(1);
            return;
        }

        File dir = new File(args[0]);
        if (!dir.isDirectory()) {
            System.err.println(dir + ": No such directory");
            System.exit(1);
            return;
        }

        // ディレクトリからファイルの一覧を読み込み
        File[] files = dir.listFiles(new FileFilter() {
            public boolean accept(File file) {
                return file.isFile();
            }
        });

        // ファイル名の辞書順でソート
        List<String> filenames = new ArrayList<>();
        for (File file : files) {
            filenames.add(file.getName());
        }
        Collections.sort(filenames);

        // ファイル名一覧の出力
        try (PrintWriter out = new PrintWriter(new FileWriter(new File(dir + "/result/out.txt")))) {
            for (String filename : filenames) {
                out.println(filename);
            }
            out.flush();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

C

#include <stdio.h>
#include <stdlib.h>
#include <dirent.h>
#include <sys/stat.h>
#include <string.h>
#include <limits.h>

int isdir(const char* path) {
    struct stat st;
    if (stat(path, &st)) {
        return 0;
    }
    return ((st.st_mode & S_IFMT) == S_IFDIR);
}

int isfile(const char* path) {
    struct stat st;
    if (stat(path, &st)) {
        return 0;
    }
    return ((st.st_mode & S_IFMT) != S_IFDIR);
}

int cmp(const void* p1, const void* p2) {
    const char* str1 = (const char*)p1;
    const char* str2 = (const char*)p2;
    return strcmp(str1, str2);
}

int main(int argc, char** argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s dirname\n", argv[0]);
        exit(1);
    }

    char dir[PATH_MAX];
    strcpy(dir, argv[1]);
    if (!isdir(dir)) {
        fprintf(stderr, "%s: No such directory\n", dir);
        exit(1);
    }

    // ディレクトリからファイルの一覧を読み込み
    DIR* dp = opendir(dir);
    if (!dp) {
        perror(dir);
        exit(1);
    }
    char filenames[256][PATH_MAX];
    int ct = 0;
    struct dirent* entry;
    while ((entry = readdir(dp))) {
        char tmp[PATH_MAX];
        sprintf(tmp, "%s/%s", dir, entry->d_name);
        if (isfile(tmp)) {
            strcpy(filenames[ct], entry->d_name);
            ++ct;
        }
    }
    closedir(dp);

    // ファイル名の辞書順でソート
    qsort(filenames, ct, sizeof(char) * PATH_MAX, cmp);

    // ファイル名一覧の出力
    char outFile[PATH_MAX];
    sprintf(outFile, "%s/result/out.txt", dir);
    FILE* outFp = fopen(outFile, "w");
    if (!outFp) {
        perror(outFile);
        exit(1);
    }
    for (int i = 0; i < ct; ++i) {
        fprintf(outFp, "%s\n", filenames[i]);
    }
    fclose(outFp);

    return 0;
}

C++

#include <iostream>
#include <fstream>
#include <vector>
#include <boost/filesystem.hpp>

using namespace std;
using namespace boost::filesystem;

int main(int argc, char** argv) {
    if (argc != 2) {
        cerr << "Usage: " << argv[0] << " dirname" << endl;
        return EXIT_FAILURE;
    }

    path dir(argv[1]);
    if (!exists(dir) || !is_directory(dir)) {
        cerr << dir << ": No such directory" << endl;
        return EXIT_FAILURE;
    }

    // ディレクトリからファイルの一覧を読み込み
    vector<string> filenames;
    try {
        directory_iterator end;
        for (directory_iterator it(dir); it != end; ++it) {
            if (!is_directory(it->path())) {
                filenames.push_back(it->path().filename().string());
            }
        }
    } catch (const filesystem_error& e) {
        cerr << e.what() << endl;
    }

    // ファイル名の辞書順でソート
    sort(filenames.begin(), filenames.end());

    // ファイル名一覧の出力
    string outFile = dir.string() + "/result/out.txt";
    ofstream outFs(outFile, ios::binary);
    if (!outFs) {
        cerr << outFile << ": Cannot open file" << endl;
        exit(1);
    }
    for (string& filename : filenames) {
        outFs << filename << endl;
    }
    outFs.close();

    return EXIT_SUCCESS;
}

Boostに頼りました。コンパイルには「-lboost_filesystem -lboost_system」が必要。

PHP

<?php

if (count($argv) != 2) {
    file_put_contents("php://stderr", "Usage: {$argv[0]} dirname" . PHP_EOL);
    exit(1);
}
$dir = $argv[1];
if (!is_dir($dir)) {
    file_put_contents('php://stderr', "{$dir}: No such directory" . PHP_EOL);
    exit(1);
}

// ディレクトリからファイルの一覧を読み込み
$dp = opendir($dir);
if (!$dp) {
    file_put_contents('php://stderr', "{$dir}: Cannot open directory" . PHP_EOL);
    exit(1);
}
$filenames = array();
while (($f = readdir($dp))) {
    if (is_file("{$dir}/{$f}")) {
        $filenames[] = $f;
    }
}
closedir($dp);

// ファイル名の辞書順でソート
sort($filenames);

// ファイル名一覧の出力
$outFile = "{$dir}/result/out.txt";
$fp = fopen($outFile, 'w');
if (!$fp) {
    file_put_contents('php://stderr', "{$outFile}: Cannot open output file" . PHP_EOL);
    exit(1);
}
foreach ($filenames as $filename) {
    fprintf($fp, "%s\n", $filename);
}
fclose($fp);

Python 2

# -*- coding: utf-8 -*-
import sys
import os

if len(sys.argv) != 2:
        sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n");
        exit(1)
dirPath = sys.argv[1];
if not os.path.isdir(dirPath):
        sys.stderr.write(dirPath + ": No such directory\n");
        exit(1)

# ディレクトリからファイルの一覧を読み込み
filenames = []
for f in os.listdir(dirPath):
        if os.path.isfile(dirPath + '/' + f):
                filenames.append(f)

# ファイル名の辞書順でソート
filenames.sort()

# ファイル名一覧の出力
outFile = dirPath + "/result/out.txt"
outFp = os.open(outFile, os.O_WRONLY | os.O_CREAT)
for f in filenames:
        os.write(outFp, f + "\n");
os.close(outFp)

Dir.globを使うという手もあるらしいのだが、試したら、隠しファイルを取得するために2回呼び出さないといけないことが分かったので、今回は正攻法で攻めた。

Python 3

import sys
import os

if len(sys.argv) != 2:
        sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n");
        exit(1)
dirPath = sys.argv[1];
if not os.path.isdir(dirPath):
        sys.stderr.write(dirPath + ": No such directory\n");
        exit(1)

# ディレクトリからファイルの一覧を読み込み
filenames = []
for f in os.listdir(dirPath):
        if os.path.isfile(dirPath + '/' + f):
                filenames.append(f)

# ファイル名の辞書順でソート
filenames.sort()

# ファイル名一覧の出力
outFile = dirPath + "/result/out.txt"
outFp = os.open(outFile, os.O_WRONLY | os.O_CREAT)
for f in filenames:
        os.write(outFp, f.encode('utf-8') + b"\n");
os.close(outFp)

Ruby

if ARGV.length != 1
    STDERR.puts('Usage: ' + __FILE__ + ' dirname')
    exit 1
end

dir = ARGV[0]
if !File.directory?(dir)
    STDERR.puts(dir + ': No such directory')
    exit 1
end

# ディレクトリからファイルの一覧を読み込み
filenames = []
Dir.foreach(dir).each do |filename|
    if File.file?(dir + '/' + filename)
        filenames.push(File.basename(filename))
    end
end

# ファイル名の辞書順でソート
filenames.sort!

# ファイル名一覧の出力
outFile = dir + '/result/out.txt'
outFp = File.open(outFile, mode = 'wb')
for filename in filenames
    outFp.puts(filename + "\n")
end
outFp.close()

Perl

if (@ARGV != 1) {
    print(STDERR 'Usage: ' . __FILE__ . " dirname\n");
    exit(1);
}
my $dir = $ARGV[0];
if (! -d $dir) {
    print(STDERR $dir . ": No such directory\n");
    exit(1);
}

# ディレクトリからファイルの一覧を読み込み
my $dp;
my $res = opendir($dp, $dir);
if (!$res) {
    print(STDERR $dir . ':' . $! . "\n");
    exit(1);
}
my @filenames = ();
my $ct = 0;
while (my $filename = readdir($dp)) {
    if (-f $dir . '/' . $filename) {
        $filenames[$ct++] = $filename;
    }
}
closedir($dp);

# ファイル名の辞書順でソート
@filenames = sort(@filenames);

# ファイル名一覧の出力
my $outFile = $dir . '/result/out.txt';
my $outFp;
my $res = open($outFp, '>', $outFile);
if (!$res) {
    print(STDERR $outFile . ':' . $! . "\n");
    exit(1);
}
for (my $i = 0; $i < @filenames; ++$i) {
    print($outFp $filenames[$i] . "\n");
}
close($outFp);

glob()を使う方法もあるらしいのだが、隠しファイルを取得するために2回呼び出さないといけない感じだったので、今回は正攻法で攻めた。

Go

package main

import (
    "os"
    "fmt"
    "io/ioutil"
    "sort"
)

func main() {
    if len(os.Args) != 2 {
        fmt.Fprintln(os.Stderr, "Usage: " + os.Args[0] + " dirname")
        os.Exit(1)
    }
    dir := os.Args[1]
    statInfo, _ := os.Stat(dir)
    if !statInfo.IsDir() {
        fmt.Fprintln(os.Stderr, dir + ": No such directory")
        os.Exit(1)
    }

    // ディレクトリからファイルの一覧を読み込み
    files, err := ioutil.ReadDir(dir)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    count := 0
    for _, file := range files {
        if (!file.IsDir()) {
            count += 1
        }
    }
    filenames := make([]string, count)
    i := 0
    for _, file := range files {
        if (!file.IsDir()) {
            filenames[i] = file.Name()
            i += 1
        }
    }

    // ファイル名の辞書順でソート
    sort.Strings(filenames)

    // ファイル名一覧の出力
    outFile := dir + "/result/out.txt"
    outFp, err := os.Create(outFile)
    if err != nil {
        fmt.Fprintln(os.Stderr, err)
        os.Exit(1)
    }
    for _, filename := range filenames {
        fmt.Fprintln(outFp, filename);
    }
}

"container/list"を使ってもよかった気がするが、ソートが面倒だったので逃げた。

bash

#! /bin/bash

dir=$1
if [ -z "${dir}" ]; then
    echo "Usage: ${0} dirname"
    exit 1
fi
if [ ! -d "${dir}" ]; then
    echo "${dir}: No such directory"
    exit 1
fi

ls -1aF "${dir}" | grep -v '/$' | tr -d / | LANG=C sort > "${dir}/result/out.txt"

「LANG=C」しないと、日本語ファイル名のファイルを含む場合にソート順が期待通りにならない。

2017-12-07

各言語での整数型の最大値と最小値

Java C C++ PHP Python Ruby Perl Go bash

唐突に、各言語での整数型の最大値と最小値をまとめてみようと思ったメモ。

環境

手元にあるものということで、環境は以下のものに限定する。なお、32ビット環境は、このために急きょ作った。

CentOS 6 (32ビット)
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.4.7)
  - -std=gnu99でコンパイル
- C++ (g++ (GCC) 4.4.7)
  - -std=gnu++0xでコンパイル
- PHP (PHP 5.3.3 (cli))
- Python 2 (Python 2.6.6)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 1.8.7 (2013-06-27 patchlevel 374))
- Perl (v5.10.1)
- Go (go version go1.7.6 linux/386)
- bash (4.1.2(2)-release)
CentOS 7 (64ビット)
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

以下、検証に使ったソースと、実行結果(32ビット環境と64ビット環境のそれぞれの実行結果のsdiff)と、補足事項を各言語ごとに記載していく。

どの言語においても、概ね以下の形で出力を出している。

最大値を求めて出力
最大値に＋１して、循環することの確認
最小値を求めて出力
最小値に－１して、循環することの確認

Java

public class Main {
    public static void main(String[] args) {
        {
            byte a;
            a = Byte.MAX_VALUE;
            System.out.println("byte:max  = " + a);
            ++a;
            System.out.println("      +1  = " + a);

            a = Byte.MIN_VALUE;
            System.out.println("byte:min  = " + a);
            --a;
            System.out.println("      -1  = " + a);
        }
        {
            short a;
            a = Short.MAX_VALUE;
            System.out.println("short:max = " + a);
            ++a;
            System.out.println("       +1 = " + a);

            a = Short.MIN_VALUE;
            System.out.println("short:min = " + a);
            --a;
            System.out.println("       -1 = " + a);
        }
        {
            int a;
            a = Integer.MAX_VALUE;
            System.out.println("int:max   = " + a);
            ++a;
            System.out.println("     +1   = " + a);

            a = Integer.MIN_VALUE;
            System.out.println("int:min   = " + a);
            --a;
            System.out.println("     -1   = " + a);
        }
        {
            long a;
            a = Long.MAX_VALUE;
            System.out.println("long:max  = " + a);
            ++a;
            System.out.println("      +1  = " + a);

            a = Long.MIN_VALUE;
            System.out.println("long:min  = " + a);
            --a;
            System.out.println("      -1  = " + a);
        }
    }
}

===32bit===                                                     ===64bit===
byte:max  = 127                                                 byte:max  = 127
      +1  = -128                                                      +1  = -128
byte:min  = -128                                                byte:min  = -128
      -1  = 127                                                       -1  = 127
short:max = 32767                                               short:max = 32767
       +1 = -32768                                                     +1 = -32768
short:min = -32768                                              short:min = -32768
       -1 = 32767                                                      -1 = 32767
int:max   = 2147483647                                          int:max   = 2147483647
     +1   = -2147483648                                              +1   = -2147483648
int:min   = -2147483648                                         int:min   = -2147483648
     -1   = 2147483647                                               -1   = 2147483647
long:max  = 9223372036854775807                                 long:max  = 9223372036854775807
      +1  = -9223372036854775808                                      +1  = -9223372036854775808
long:min  = -9223372036854775808                                long:min  = -9223372036854775808
      -1  = 9223372036854775807                                       -1  = 9223372036854775807

さすがにJavaは、環境によって最大値や最小値が変わることはない。

C

#include <stdio.h>
#include <limits.h>

int main(int argc, char** argv) {
    {
        short a;
        a = SHRT_MAX;
        printf("short:max       = %d\n", a);
        ++a;
        printf("             +1 = %d\n", a);

        a = SHRT_MIN;
        printf("short:min       = %d\n", a);
        --a;
        printf("             -1 = %d\n", a);
    }

    {
        unsigned short a;
        a = USHRT_MAX;
        printf("ushort:max      = %u\n", a);
        ++a;
        printf("             +1 = %u\n", a);

        a = 0;
        printf("ushort:min      = %u\n", a);
        --a;
        printf("             -1 = %u\n", a);
    }

    {
        int a;
        a = INT_MAX;
        printf("int:max         = %d\n", a);
        ++a;
        printf("             +1 = %d\n", a);

        a = INT_MIN;
        printf("int:min         = %d\n", a);
        --a;
        printf("             -1 = %d\n", a);
    }

    {
        unsigned int a;
        a = UINT_MAX;
        printf("uint:max        = %u\n", a);
        ++a;
        printf("             +1 = %u\n", a);

        a = 0;
        printf("uint:min        = %u\n", a);
        --a;
        printf("             -1 = %u\n", a);
    }

    {
        long a;
        a = LONG_MAX;
        printf("long:max        = %ld\n", a);
        ++a;
        printf("             +1 = %ld\n", a);

        a = LONG_MIN;
        printf("long:min        = %ld\n", a);
        --a;
        printf("             -1 = %ld\n", a);
    }

    {
        unsigned long a;
        a = ULONG_MAX;
        printf("ulong:max       = %lu\n", a);
        ++a;
        printf("             +1 = %lu\n", a);

        a = 0;
        printf("ulong:min       = %lu\n", a);
        --a;
        printf("             -1 = %lu\n", a);
    }

    {
        long long a;
        a = LLONG_MAX;
        printf("long long:max   = %lld\n", a);
        ++a;
        printf("             +1 = %lld\n", a);

        a = LLONG_MIN;
        printf("long long:min   = %lld\n", a);
        --a;
        printf("             -1 = %lld\n", a);
    }

    {
        unsigned long long a;
        a = ULLONG_MAX;
        printf("ulong long:max  = %llu\n", a);
        ++a;
        printf("             +1 = %llu\n", a);

        a = 0;
        printf("ulong long:min  = %llu\n", a);
        --a;
        printf("             -1 = %llu\n", a);
    }

    return 0;
}

===32bit===                                                     ===64bit===
short:max       = 32767                                         short:max       = 32767
             +1 = -32768                                                     +1 = -32768
short:min       = -32768                                        short:min       = -32768
             -1 = 32767                                                      -1 = 32767
ushort:max      = 65535                                         ushort:max      = 65535
             +1 = 0                                                          +1 = 0
ushort:min      = 0                                             ushort:min      = 0
             -1 = 65535                                                      -1 = 65535
int:max         = 2147483647                                    int:max         = 2147483647
             +1 = -2147483648                                                +1 = -2147483648
int:min         = -2147483648                                   int:min         = -2147483648
             -1 = 2147483647                                                 -1 = 2147483647
uint:max        = 4294967295                                    uint:max        = 4294967295
             +1 = 0                                                          +1 = 0
uint:min        = 0                                             uint:min        = 0
             -1 = 4294967295                                                 -1 = 4294967295
long:max        = 2147483647                                  | long:max        = 9223372036854775807
             +1 = -2147483648                                 |              +1 = -9223372036854775808
long:min        = -2147483648                                 | long:min        = -9223372036854775808
             -1 = 2147483647                                  |              -1 = 9223372036854775807
ulong:max       = 4294967295                                  | ulong:max       = 18446744073709551615
             +1 = 0                                                          +1 = 0
ulong:min       = 0                                             ulong:min       = 0
             -1 = 4294967295                                  |              -1 = 18446744073709551615
long long:max   = 9223372036854775807                           long long:max   = 9223372036854775807
             +1 = -9223372036854775808                                       +1 = -9223372036854775808
long long:min   = -9223372036854775808                          long long:min   = -9223372036854775808
             -1 = 9223372036854775807                                        -1 = 9223372036854775807
ulong long:max  = 18446744073709551615                          ulong long:max  = 18446744073709551615
             +1 = 0                                                          +1 = 0
ulong long:min  = 0                                             ulong long:min  = 0
             -1 = 18446744073709551615                                       -1 = 18446744073709551615

違いが出たのはlong/unsigned longの部分。32ビット環境ではintと同じで、64ビット環境ではlong longと同じ。

C++

#include <iostream>
#include <limits>

using namespace std;

int main(int argc, char** argv) {
    {
        short a;
        a = numeric_limits<short>::max();
        cout << "short:max      = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<short>::min();
        cout << "short:min      = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned short a;
        a = numeric_limits<unsigned short>::max();
        cout << "ushort:max     = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned short>::min();
        cout << "ushort:min     = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        int a;
        a = numeric_limits<int>::max();
        cout << "int:max        = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<int>::min();
        cout << "int:min        = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned int a;
        a = numeric_limits<unsigned int>::max();
        cout << "uint:max       = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned int>::min();
        cout << "uint:min       = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        long a;
        a = numeric_limits<long>::max();
        cout << "long:max       = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<long>::min();
        cout << "long:min       = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned long a;
        a = numeric_limits<unsigned long>::max();
        cout << "ulong:max      = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned long>::min();
        cout << "ulong:min      = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        long long a;
        a = numeric_limits<long long>::max();
        cout << "long long:max  = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<long long>::min();
        cout << "long long:min  = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    {
        unsigned long long a;
        a = numeric_limits<unsigned long long>::max();
        cout << "ulong long:max = " << a << endl;
        ++a;
        cout << "            +1 = " << a << endl;

        a = numeric_limits<unsigned long long>::min();
        cout << "ulong long:min = " << a << endl;
        --a;
        cout << "            -1 = " << a << endl;
    }

    return EXIT_SUCCESS;
}

===32bit===                                                     ===64bit===
short:max      = 32767                                          short:max      = 32767
            +1 = -32768                                                     +1 = -32768
short:min      = -32768                                         short:min      = -32768
            -1 = 32767                                                      -1 = 32767
ushort:max     = 65535                                          ushort:max     = 65535
            +1 = 0                                                          +1 = 0
ushort:min     = 0                                              ushort:min     = 0
            -1 = 65535                                                      -1 = 65535
int:max        = 2147483647                                     int:max        = 2147483647
            +1 = -2147483648                                                +1 = -2147483648
int:min        = -2147483648                                    int:min        = -2147483648
            -1 = 2147483647                                                 -1 = 2147483647
uint:max       = 4294967295                                     uint:max       = 4294967295
            +1 = 0                                                          +1 = 0
uint:min       = 0                                              uint:min       = 0
            -1 = 4294967295                                                 -1 = 4294967295
long:max       = 2147483647                                   | long:max       = 9223372036854775807
            +1 = -2147483648                                  |             +1 = -9223372036854775808
long:min       = -2147483648                                  | long:min       = -9223372036854775808
            -1 = 2147483647                                   |             -1 = 9223372036854775807
ulong:max      = 4294967295                                   | ulong:max      = 18446744073709551615
            +1 = 0                                                          +1 = 0
ulong:min      = 0                                              ulong:min      = 0
            -1 = 4294967295                                   |             -1 = 18446744073709551615
long long:max  = 9223372036854775807                            long long:max  = 9223372036854775807
            +1 = -9223372036854775808                                       +1 = -9223372036854775808
long long:min  = -9223372036854775808                           long long:min  = -9223372036854775808
            -1 = 9223372036854775807                                        -1 = 9223372036854775807
ulong long:max = 18446744073709551615                           ulong long:max = 18446744073709551615
            +1 = 0                                                          +1 = 0
ulong long:min = 0                                              ulong long:min = 0
            -1 = 18446744073709551615                                       -1 = 18446744073709551615

違いが出たのはlong/unsigned longの部分。32ビット環境ではintと同じで、64ビット環境ではlong longと同じ。

PHP

<?php

$a = 1;
while (($a << 1) + 1 > $a) {
    $a <<= 1;
    $a += 1;
}
echo "int:max = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);
++$a;
echo "     +1 = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);

$a = -1;
while (($a << 1) < $a) {
    $a <<= 1;
}
echo "int:min = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);
--$a;
echo "     -1 = " . $a . PHP_EOL;
echo "        = ";
var_dump($a);

===32bit===                                                     ===64bit===
int:max = 2147483647                                          | int:max = 9223372036854775807
        = int(2147483647)                                     |         = int(9223372036854775807)
     +1 = 2147483648                                          |      +1 = 9.2233720368548E+18
        = float(2147483648)                                   |         = float(9.2233720368548E+18)
int:min = -2147483648                                         | int:min = -9223372036854775808
        = int(-2147483648)                                    |         = int(-9223372036854775808)
     -1 = -2147483649                                         |      -1 = -9.2233720368548E+18
        = float(-2147483649)                                  |         = float(-9.2233720368548E+18)

最大値、最小値を表す定数が無いので、計算によって求めている。

最初だまされたのは、32ビット環境で最大値に＋１、最小値に－１したときに、一見するとint型に収まっているように見えたこと。 var_dump()すると、float型に変わっていることが分かる。

Python 2

a = 1
ct = 0
while ct < 128 and (a << 1) + 1 > a:
    a <<= 1
    a += 1
    ct += 1
print "long:max? = ",a
print "          = ",type(a)
a += 1
print "       +1 = ",a
print "          = ",type(a)

a = -1
ct = 0
while ct < 128 and (a << 1) < a:
    a <<= 1
    ct += 1
print "long:min? = ",a
print "          = ",type(a)
a -= 1
print "       -1 = ",a
print "          = ",type(a)

===32bit===                                                     ===64bit===
long:max? =  680564733841876926926749214863536422911            long:max? =  680564733841876926926749214863536422911
          =  <type 'long'>                                                =  <type 'long'>
       +1 =  680564733841876926926749214863536422912                   +1 =  680564733841876926926749214863536422912
          =  <type 'long'>                                                =  <type 'long'>
long:min? =  -340282366920938463463374607431768211456           long:min? =  -340282366920938463463374607431768211456
          =  <type 'long'>                                                =  <type 'long'>
       -1 =  -340282366920938463463374607431768211457                  -1 =  -340282366920938463463374607431768211457
          =  <type 'long'>                                                =  <type 'long'>

最大値や最小値という概念が無いことを知っていたので、128ビットまで計算したところで打ち切っている。計算で出てきた数値に＋１、－１してもまだ余地があることが分かる。

Python 3

a = 1
ct = 0
while ct < 128 and (a << 1) + 1 > a:
    a <<= 1
    a += 1
    ct += 1
print("ing:max? = ",a)
print("         = ",type(a))
a += 1
print("      +1 = ",a)
print("         = ",type(a))

a = -1
ct = 0
while ct < 128 and (a << 1) < a:
    a <<= 1
    ct += 1
print("int:min? = ",a)
print("         = ",type(a))
a -= 1
print("      -1 = ",a)
print("         = ",type(a))

===32bit===                                                     ===64bit===
ing:max? =  680564733841876926926749214863536422911             ing:max? =  680564733841876926926749214863536422911
         =  <class 'int'>                                                =  <class 'int'>
      +1 =  680564733841876926926749214863536422912                   +1 =  680564733841876926926749214863536422912
         =  <class 'int'>                                                =  <class 'int'>
int:min? =  -340282366920938463463374607431768211456            int:min? =  -340282366920938463463374607431768211456
         =  <class 'int'>                                                =  <class 'int'>
      -1 =  -340282366920938463463374607431768211457                  -1 =  -340282366920938463463374607431768211457
         =  <class 'int'>                                                =  <class 'int'>

こちらはPython 2の場合と同じ。

Ruby

a = 1
ct = 0
while ct < 128 && (a << 1) + 1 > a
    a <<= 1
    a += 1
    ct += 1
end
print "int:max? = ",a,"\n"
a += 1
print "      +1 = ",a,"\n"

a = -1
ct = 0
while ct < 128 && (a << 1) < a
    a <<= 1
    ct += 1
end
print "int:min? = ",a,"\n"
a -= 1
print "      -1 = ",a,"\n"

===32bit===                                                     ===64bit===
int:max? = 680564733841876926926749214863536422911              int:max? = 680564733841876926926749214863536422911
      +1 = 680564733841876926926749214863536422912                    +1 = 680564733841876926926749214863536422912
int:min? = -340282366920938463463374607431768211456             int:min? = -340282366920938463463374607431768211456
      -1 = -340282366920938463463374607431768211457                   -1 = -340282366920938463463374607431768211457

Rubyも最大値や最小値が無いことを知っていたので、Pythonと同じく128ビットで打ち切っている。

Perl

# 参考：http://d.hatena.ne.jp/sardine/20131026
my $a = ~0;
print "int:max = ",$a,"\n";
++$a;
print "     +1 = ",$a,"\n";

my $a = -(~0 >> 1) - 1;
print "int:min = ",$a,"\n";
--$a;
print "     -1 = ",$a,"\n";

===32bit===                                                     ===64bit===
int:max = 4294967295                                          | int:max = 18446744073709551615
     +1 = 4294967296                                          |      +1 = 1.84467440737096e+19
int:min = -2147483648                                         | int:min = -9223372036854775808
     -1 = -2147483649                                         |      -1 = -9.22337203685478e+18

最初、ビット演算しても期待した結果が得られなくてはまっていた。ソースに書かれた参考サイトの情報が無ければ変な結果を得ていただろう。

32ビットの結果がちょっと変で、最大値／最小値を突き抜けて＋１／－１できているように見える。これは何なんだろう・・・

(2017/12/08)様子が分かったので追記。

あるサイト（perl - check if a number is int or float - Stack Overflow）を参考に、変数のダンプ情報を出すようにしてみた。

use Devel::Peek;

# 参考：http://d.hatena.ne.jp/sardine/20131026
# 参考：https://stackoverflow.com/questions/4094036/check-if-a-number-is-int-or-float
my $a = ~0;
print "int:max = ",$a,"\n";
Dump($a);
print STDERR "\n";
++$a;
print "     +1 = ",$a,"\n";
Dump($a);
print STDERR "\n";

my $a = -(~0 >> 1) - 1;
print "int:min = ",$a,"\n";
Dump($a);
print STDERR "\n";
--$a;
print "     -1 = ",$a,"\n";
Dump($a);

すると、以下のような出力が得られる。

===32bit===                                                     ===64bit===
int:max = 4294967295                                          | int:max = 18446744073709551615
SV = IV(0x8139d84) at 0x8139d88                               | SV = IV(0x106a778) at 0x106a788
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK,IsUV)                                   FLAGS = (PADMY,IOK,pIOK,IsUV)
  UV = 4294967295                                             |   UV = 18446744073709551615

     +1 = 4294967296                                          |      +1 = 1.84467440737096e+19
SV = PVNV(0x811e9e0) at 0x8139d88                             | SV = PVNV(0x104cfe0) at 0x106a788
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,NOK,POK,pNOK,pPOK)                               FLAGS = (PADMY,NOK,POK,pNOK,pPOK)
  IV = 0                                                          IV = 0
  NV = 4294967296                                             |   NV = 1.84467440737096e+19
  PV = 0x81417a0 "4294967296"\0                               |   PV = 0x106d530 "1.84467440737096e+19"\0
  CUR = 10                                                    |   CUR = 20
  LEN = 36                                                    |   LEN = 40

int:min = -2147483648                                         | int:min = -9223372036854775808
SV = IV(0x8139e14) at 0x8139e18                               | SV = IV(0x106a940) at 0x106a950
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,IOK,pIOK)                                        FLAGS = (PADMY,IOK,pIOK)
  IV = -2147483648                                            |   IV = -9223372036854775808

     -1 = -2147483649                                         |      -1 = -9.22337203685478e+18
SV = PVNV(0x811e9f4) at 0x8139e18                             | SV = PVNV(0x104d000) at 0x106a950
  REFCNT = 1                                                      REFCNT = 1
  FLAGS = (PADMY,NOK,POK,pNOK,pPOK)                               FLAGS = (PADMY,NOK,POK,pNOK,pPOK)
  IV = -2147483648                                            |   IV = -9223372036854775808
  NV = -2147483649                                            |   NV = -9.22337203685478e+18
  PV = 0x8145810 "-2147483649"\0                              |   PV = 0x106d820 "-9.22337203685478e+18"\0
  CUR = 11                                                    |   CUR = 21
  LEN = 36                                                    |   LEN = 40

これを見ると、最大値／最小値に＋１／－１した場合はfloatに自動変換されていることが分かる。これですっきり。

Go

package main

import (
    "fmt"
)

func main() {
    var a = 0
    var ct = 0

    a = 1
    ct = 0
    for ct < 128 && (a << 1) + 1 > a {
        a <<= 1
        a += 1
        ct += 1
    }
    fmt.Printf("int:max = %d\n", a)
    a += 1
    fmt.Printf("     +1 = %d\n", a)

    a = -1
    ct = 0
    for ct < 128 && (a << 1) < a {
        a <<= 1
        ct += 1
    }
    fmt.Printf("int:min = %d\n", a)
    a -= 1
    fmt.Printf("     -1 = %d\n", a)
}

===32bit===                                                     ===64bit===
int:max = 2147483647                                          | int:max = 9223372036854775807
     +1 = -2147483648                                         |      +1 = -9223372036854775808
int:min = -2147483648                                         | int:min = -9223372036854775808
     -1 = 2147483647                                          |      -1 = 9223372036854775807

Go言語は32ビット／64ビットの影響を受けるのだなとちょっと意外だった。

bash

#! /bin/bash

a=1
while [ $(((a << 1) + 1)) -gt ${a} ]; do
    a=$(((a << 1) + 1))
done
echo "int:max = ${a}"
a=$((a + 1))
echo "     +1 = ${a}"

a=-1
while [ $((a << 1)) -lt ${a} ]; do
    a=$((a << 1))
done
echo "int:min = ${a}"
a=$((a - 1))
echo "     -1 = ${a}"

===32bit===                                                     ===64bit===
int:max = 9223372036854775807                                   int:max = 9223372036854775807
     +1 = -9223372036854775808                                       +1 = -9223372036854775808
int:min = -9223372036854775808                                  int:min = -9223372036854775808
     -1 = 9223372036854775807                                        -1 = 9223372036854775807

逆に、シェルスクリプトは32ビット／64ビットの影響を受けると思っていたので意外だった。

まとめ

一覧表にしてみる。

		32bit max	32bit min	64bit max	64bit min
Java	byte	127	-128	127	-128
	short	32767	-32768	32767	-32768
	int	2147483647	-2147483648	2147483647	-2147483648
	long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
C	short	32767	-32768	32767	-32768
	unsigned short	65535	0	65535	0
	int	2147483647	-2147483648	2147483647	-2147483648
	unsigned int	4294967295	0	4294967295	0
	long	2147483647	-2147483648	9223372036854775807	-9223372036854775808
	unsigned long	4294967295	0	18446744073709551615	0
	long long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
	unsigned long long	18446744073709551615	0	18446744073709551615	0
C++	short	32767	-32768	32767	-32768
	unsigned short	65535	0	65535	0
	int	2147483647	-2147483648	2147483647	-2147483648
	unsigned int	4294967295	0	4294967295	0
	long	2147483647	-2147483648	9223372036854775807	-9223372036854775808
	unsigned long	4294967295	0	18446744073709551615	0
	long long	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808
	unsigned long long	18446744073709551615	0	18446744073709551615	0
PHP	int	2147483647	-2147483648	9223372036854775807	-9223372036854775808
Python 2	long	∞	-∞	∞	-∞
Python 3	ing	∞	-∞	∞	-∞
Ruby	int	∞	-∞	∞	-∞
Perl	int	4294967295	-2147483648	18446744073709551615	-9223372036854775808
Go	int	2147483647	-2147483648	9223372036854775807	-9223372036854775808
bash	int	9223372036854775807	-9223372036854775808	9223372036854775807	-9223372036854775808

2017-12-05

各言語でUTF-8バイト列を文字列置換および文字列分割してみる

Java C C++ PHP Python Ruby Perl Go bash

各言語でUTF-8のバイト列を読み込み、文字列置換と文字列分割をしてみたメモ。

要件は以下の通り。

標準入力から、文字列が1行だけ入力される。
- 文字エンコーディングはUTF-8
- 入力文字数は高々10文字とする
標準出力に、以下の2つを改行区切りで出力する。
- 文字列の各文字をすべて'.'で置き換えた文字列
- 入力文字列の各文字を改行で区切ったもの
つまり、10文字の文字列が入力されたら、出力は11行になる

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

Java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;

public class Main {
    public static void main(String[] args) {
        try (BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            PrintWriter out = new PrintWriter(System.out)
        ) {
            String s = in.readLine();

            // 文字列置換
            out.println(s.replaceAll(".", "."));

            // 文字列分割
            for (int i = 0; i < s.length(); ++i) {
                char ch1 = s.charAt(i);
                out.print(ch1);
                if (Character.isSurrogate(ch1)) {
                    ++i;
                    char ch2 = s.charAt(i);
                    out.print(ch2);
                }
                out.println();
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

C

#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <stdlib.h>
#include <regex.h>

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");

    char str[1024];

    fgets(str, sizeof(str), stdin);
    while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
        str[strlen(str) - 1] = '\0';
    }

    regex_t rb;
    if (regcomp(&rb, ".", REG_EXTENDED | REG_NEWLINE)) {
        perror("regcomp");
        return 1;
    }

    const char* p;
    regmatch_t rm;
    int err;
    int idx;

    // 文字列置換
    p = str;
    idx = 0;
    do {
        err = regexec(&rb, p + idx, 1, &rm, 0);
        if (!err) {
            if (rm.rm_so > 0) {
                char buf[1024];
                memset(buf, '\0', sizeof(buf));
                strncpy(buf, p + idx, rm.rm_so);
                fprintf(stdout, "%s", buf);
            }
            fprintf(stdout, ".");
            idx += rm.rm_eo;
        }
    } while (!err);
    fprintf(stdout, "%s\n", p + idx);

    // 文字列分割
    p = str;
    idx = 0;
    do {
        err = regexec(&rb, p + idx, 1, &rm, 0);
        if (!err) {
            char buf[1024];
            memset(buf, '\0', sizeof(buf));
            strncpy(buf, p + idx + rm.rm_so, rm.rm_eo - rm.rm_so);
            fprintf(stdout, "%s\n", buf);
            idx += rm.rm_eo;
        }
    } while (!err);

    regfree(&rb);

    return 0;
}

C++

#include <iostream>
#include <locale>
#include <string>

using namespace std;

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");
    wcout.imbue(locale("japanese"));

    wstring str;
    getline(wcin, str);

    // 文字列置換(ごまかし)
    //   regex_matchが完全マッチにしか対応してなくて使えないので。
    for (int i = 0; i < str.length(); ++i) {
        wcout << L".";
    }
    wcout << endl;

    // 文字列分割(ごまかし)
    //   regex_matchが完全マッチにしか対応してなくて使えないので。
    for (int i = 0; i < str.length(); ++i) {
        wcout << str[i] << endl;
    }

    return EXIT_SUCCESS;
}

(2017/12/05追記)

「yum install boost-devel」してBoostのライブラリを使うようにしたらまともに動いてくれたので、そのソースコードを追記。コンパイル時に「-lboost_regex」が必要。

#include <iostream>
#include <locale>
#include <string>
#include <boost/regex.hpp>

using namespace std;

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");
    wcout.imbue(locale("japanese"));

    wstring str;
    getline(wcin, str);

    boost::wregex re(L".");

    // 文字列置換
    wcout << boost::regex_replace(str, re, L".") << endl;

    // 文字列分割
    boost::wsmatch sm;
    wstring::const_iterator start = str.begin();
    wstring::const_iterator end = str.end();
    int offset = 0;
    while (boost::regex_search(start + offset, end, sm, re)) {
        size_t idx = 0;
        for (int i = 0; i < sm.length(idx); ++i) {
            wcout << str[sm.position(idx) + offset + i];
        }
        wcout << endl;
        offset += sm.position(idx) + sm.length(idx);
    }

    return EXIT_SUCCESS;
}

PHP

<?php

$str = file_get_contents('php://stdin');
$str = trim($str);

mb_regex_encoding('UTF-8');

// 文字列置換
//   mb_xxx系ではereg版しかない
//   パターンの書き方がpreg系の関数と違うことに注意・・
echo mb_ereg_replace('.', '.', $str) . PHP_EOL;

// 文字列分割
//   mb_xxx系ではereg版しかない
//   パターンの書き方がpreg系の関数と違うことに注意・・
$tmp = $str;
do {
    mb_ereg_search_init($tmp, '.');
    $range = mb_ereg_search_pos();
    if ($range !== false) {
        echo substr($tmp, $range[0], $range[1]) . PHP_EOL;
        $tmp = substr($tmp, $range[1]);
    }
} while ($tmp !== false && $range !== false);

Python 2

# -*- coding: UTF-8 -*-
import sys
import re

s = sys.stdin.readline()
ustr = unicode(s, 'UTF-8')
ustr = ustr.replace('\n', '')
ustr = ustr.replace('\r', '')

# 文字列置換
print re.sub(r'.', '.', ustr)

# 文字列分割
for i in range(0, len(ustr)):
    print ustr[i].encode('UTF-8')

Python 3

# -*- coding: UTF-8 -*-
import sys
import re

b = sys.stdin.buffer.readline()
s = str(b, 'UTF-8')
s = s.replace('\n', '')
s = s.replace('\r', '')

# 文字列置換
print(re.sub(r'.', '.', s))

# 文字列分割
for i in range(0, len(s)):
    print(s[i])

Ruby

str = STDIN.gets
str.chomp!()

# 文字列置換
print str.gsub(/./, '.'),"\n"

# 文字列分割
for i in 0...str.size()
    print str[i],"\n"
end

Perl

use Encode;

my $str = readline(STDIN);
chomp($str);

# 文字列置換
my $ustr = decode('UTF-8', $str);
my $tmp = $ustr;
$tmp =~ s/././g;
print $tmp,"\n";

# 文字列分割
my $tmp = $ustr;
for (my $i = 0; $i < length($ustr); ++$i) {
    print encode('UTF-8', substr($tmp, $i, 1)),"\n";
}

Go

package main

import (
    "fmt"
    "os"
    "io"
    "bufio"
    "regexp"
)

func ReadLine(reader *bufio.Reader) (s string, err error) {
    prefix := false
    buf := make([]byte, 0)
    var line []byte
    for {
        line, prefix, err = reader.ReadLine()
        if err == io.EOF {
            return
        }
        buf = append(buf, line...)
        if prefix {
            continue
        }
        s = string(buf)
        return
    }
}

func main() {
    stdin := bufio.NewReader(os.Stdin)
    s, _ := ReadLine(stdin)

    ss := regexp.MustCompile(`.`).ReplaceAllString(s, ".")
    fmt.Println(ss)

    runes := []rune(s)
    for i := 0; i < len(runes); i += 1 {
        fmt.Println(string(runes[i]))
    }
}

bash

#! /bin/bash

IFS= read s

echo "${s}" | sed -e 's/././g'

echo -n "${s}" | sed -e 's/\(.\)/\1\n/g'

2017-12-04

各言語でUTF-8バイト列からバイト数と文字数を取ってみる

Java C C++ PHP Python Ruby Perl Go bash

各言語でUTF-8のバイト列を読み込み、バイト数とUnicodeでの文字数を取得してみたメモ。

要件は以下の通り。

標準入力から、文字列が1行だけ入力される。
- 文字エンコーディングはUTF-8
- 入力文字数は高々10文字とする
標準出力に、以下の3つを改行区切りで出力する。
- 文字列の総バイト数
- 長さ
- 入力文字列そのもの

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

Java

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.PrintWriter;

public class Main {
    public static void main(String[] args) {
        try (BufferedReader in = new BufferedReader(new InputStreamReader(System.in));
            PrintWriter out = new PrintWriter(System.out)
        ) {
            String s = in.readLine();
            out.println(s.getBytes().length);
            out.println(s.codePointCount(0, s.length()));
            out.println(s);
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Javaは、歴史的経緯から、サロゲートペアをcharで表すことができないので、文字数を知りたいときにString.length()を呼んではダメ。

C

#include <stdio.h>
#include <string.h>
#include <locale.h>
#include <wchar.h>
#include <stdlib.h>

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");

    char str[1024];

    fgets(str, sizeof(str), stdin);
    while (str[strlen(str) - 1] == '\n' || str[strlen(str) - 1] == '\r') {
        str[strlen(str) - 1] = '\0';
    }

    fprintf(stdout, "%d\n", strlen(str));

    wchar_t buf[1024];
    const char* p = str;
    mbsrtowcs(buf, &p, sizeof(buf), NULL);
    fprintf(stdout, "%d\n", wcslen(buf));

    fprintf(stdout, "%s\n", str);

    return 0;
}

C++

#include <iostream>
#include <cwchar>
#include <clocale>
#include <string>
#include <cstring>

using namespace std;

int main(int argc, char** argv) {
    setlocale(LC_ALL, "ja_JP.UTF-8");

    wstring str;
    getline(wcin, str);

    char cbuf[1024];
    wcstombs(cbuf, str.c_str(), sizeof(cbuf));
    wcout << strlen(cbuf) << endl;

    wcout << str.length() << endl;

    wcout << str << endl;

    return EXIT_SUCCESS;
}

PHP

<?php

$str = file_get_contents('php://stdin');
$str = trim($str);

echo strlen($str) . PHP_EOL;

echo mb_strlen($str, 'UTF-8') . PHP_EOL;

echo $str . PHP_EOL;

Python 2

import sys

s = sys.stdin.readline()
s = s.replace('\n', '')
s = s.replace('\r', '')

print len(s)

ustr = unicode(s, 'UTF-8')
print len(ustr)

print s

Python 3

import sys

b = sys.stdin.buffer.readline()
s = str(b, 'UTF-8')
s = s.replace('\n', '')
s = s.replace('\r', '')
b = bytes(s, 'UTF-8')

print(len(b))

print(len(s))

print(s)

Ruby

str = STDIN.gets
str.chomp!()

print str.bytes().size(),"\n"

print str.size(),"\n"

print str,"\n"

Perl

use Encode;

my $str = readline(STDIN);
chomp($str);

print length($str),"\n";

my $b = $str;
$b = decode('UTF-8', $b);
print length($b),"\n";

print $str,"\n";

Go

package main

import (
    "fmt"
    "os"
    "io"
    "bufio"
)

func ReadLine(reader *bufio.Reader) (s string, err error) {
    prefix := false
    buf := make([]byte, 0)
    var line []byte
    for {
        line, prefix, err = reader.ReadLine()
        if err == io.EOF {
            return
        }
        buf = append(buf, line...)
        if prefix {
            continue
        }
        s = string(buf)
        return
    }
}

func main() {
    stdin := bufio.NewReader(os.Stdin)
    s, _ := ReadLine(stdin)

    fmt.Println(len(s))

    runes := []rune(s)
    fmt.Println(len(runes))

    fmt.Println(s)
}

bash

#! /bin/bash

IFS= read s

echo -n "${s}" | wc -c

echo ${#s}

echo "${s}"

まさか、サロゲートペアを含む文字列の文字数をシェルスクリプトでちゃんと取れるとは思ってなかった。

2017-12-03

各言語でファイル入出力＋文字エンコーディング変換

Java C C++ PHP Python Ruby Perl Go bash

各言語でファイル入出力と文字エンコーディング変換を書いてみたメモ。

やってる途中で、別々のエントリに分けた方が良かったかもと思ったりもしたが、例えばJavaなんかは内部的には「文字」はUTF-8だったりして入出力と文字エンコーディング変換が深くかかわっていたりするので、まぁいいかということで。

要件は以下の通り。

コマンドライン引数として、ディレクトリのパスを渡す
- ディレクトリ内に「in.txt」という、EUC-JPなファイルが置いてある
「in.txt」を読み込み、文字エンコーディングをShift_JISに変換して、「out.txt」として同じディレクトリに書き込む

環境

手元にあるものということで、環境は以下のものに限定する。

CentOS 7
- Java (openjdk version "1.8.0_151")
- C (gcc (GCC) 4.8.5)
  - -std=gnu11でコンパイル
- C++ (g++ (GCC) 4.8.5)
  - -std=gnu++1yでコンパイル
- PHP (PHP 5.4.16 (cli))
- Python 2 (Python 2.7.5)
- Python 3 (Python 3.6.3)
  - ソースからビルドしたもの
- Ruby (ruby 2.0.0p648)
- Perl (v5.16.3)
- Go (go version go1.8.3 linux/amd64)
- bash (4.2.46(1)-release)

Java

java.nio.charsetパッケージのCharsetDecoder / CharsetEncoderの存在を今更知ったので、2パターン書いてみた。

パターン1
- ファイル読み込み時にInputStreamReaderで文字エンコーディング変換
- ファイル書き込み時にOutputStreamWriterで文字エンコーディング変換
パターン2
- 読み込んだバイト列をCharsetDecoderで文字列に変換
- 文字列をCharsetEncoderでバイト列に変換して書き込み

パターン1

import java.io.BufferedReader;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;

public class Main {
    public static void main(String[] args) {
        if (args.length != 1) {
            System.err.println("Usage: java Main dirname");
            System.exit(1);
            return;
        }

        File inFile = new File(args[0], "in.txt");
        File outFile = new File(args[0], "out.txt");

        try (BufferedReader in = new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "EUC-JP"));
            PrintWriter out = new PrintWriter(new OutputStreamWriter(new FileOutputStream(outFile), "Windows-31J"))
        ) {
            char[] buf = new char[1024];
            int len;
            while ((len = in.read(buf)) > 0) {
                out.write(buf, 0, len);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

パターン2

import java.io.BufferedInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.PrintStream;
import java.nio.ByteBuffer;
import java.nio.CharBuffer;
import java.nio.charset.Charset;
import java.nio.charset.CharsetDecoder;
import java.nio.charset.CharsetEncoder;
import java.nio.charset.CoderResult;

public class Main {
    public static void main(String[] args) {
        if (args.length != 1) {
            System.err.println("Usage: java Main dirname");
            System.exit(1);
            return;
        }

        File inFile = new File(args[0], "in.txt");
        File outFile = new File(args[0], "out.txt");

        CharsetDecoder decoder = Charset.forName("EUC-JP").newDecoder();
        decoder.reset();
        CharsetEncoder encoder = Charset.forName("Windows-31J").newEncoder();
        encoder.reset();

        try (InputStream in = new BufferedInputStream(new FileInputStream(inFile));
            PrintStream out = new PrintStream(new FileOutputStream(outFile))
        ) {
            ByteBuffer inBuf = ByteBuffer.allocate(1024);
            ByteBuffer outBuf = ByteBuffer.allocate(1024);
            CharBuffer tmpBuf = CharBuffer.allocate(1024);
            byte[] buf = new byte[1024];
            int len;
            while ((len = in.read(buf, 0, Math.min(buf.length, inBuf.remaining()))) > 0) {
                inBuf.put(buf, 0, len);
                inBuf.flip();
                tmpBuf.clear();
                CoderResult res = decoder.decode(inBuf, tmpBuf, false);
                if (res.isUnderflow()) {
                    inBuf.compact();
                    tmpBuf.flip();
                    outBuf.clear();
                    encoder.encode(tmpBuf, outBuf, false);
                    outBuf.flip();
                    outBuf.get(buf, 0, outBuf.limit());
                    out.write(buf, 0, outBuf.limit());
                }
            }
            /* flush()するためのダミー処理 */
            inBuf.clear();
            inBuf.flip();
            tmpBuf.clear();
            decoder.decode(inBuf, tmpBuf, true);
            tmpBuf.flip();
            outBuf.clear();
            encoder.encode(tmpBuf, outBuf, true);
            outBuf.flip();
            outBuf.get(buf, 0, outBuf.limit());
            out.write(buf, 0, outBuf.limit());

            /* flush() */
            tmpBuf.clear();
            decoder.flush(tmpBuf);
            tmpBuf.flip();
            outBuf.clear();
            encoder.flush(outBuf);
            outBuf.flip();
            outBuf.get(buf, 0, outBuf.limit());
            out.write(buf, 0, outBuf.limit());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

なんか、CharsetDecoder / CharsetEncoderの挙動を理解するのにすごく時間が掛かった、というか、そもそもByteBuffer / CharBufferの挙動もよく分からんかった。今でも、上記プログラムでほんとに正しいのか全く自信が無い‥

C

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <iconv.h>
#include <limits.h>
#include <errno.h>

#define BUF_SIZE 1024

int main(int argc, char** argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s dirname\n", argv[0]);
        return 1;
    }

    /* ファイルのオープン */
    char inFile[PATH_MAX];
    sprintf(inFile, "%s/in.txt", argv[1]);
    char outFile[PATH_MAX];
    sprintf(outFile, "%s/out.txt", argv[1]);

    FILE* inFp = fopen(inFile, "r");
    if (!inFp) {
        perror(inFile);
        return 1;
    }
    FILE* outFp = fopen(outFile, "w");
    if (!outFp) {
        perror(outFile);    
        fclose(inFp);
        return 1;
    }

    /* 文字エンコーディング変換の準備 */
    iconv_t iconvHandler = iconv_open("CP932", "EUC-JP");

    /* 入力を読み込んで文字エンコーディング変換して出力 */
    char inBuf[BUF_SIZE];
    size_t inBufLeft = 0;
    char outBuf[BUF_SIZE];
    int len;
    while ((len = fread(inBuf + inBufLeft, sizeof(char), sizeof(inBuf) - inBufLeft, inFp)) + inBufLeft > 0) {
        inBufLeft += len;
        char* inPtr = inBuf;
        char* outPtr = outBuf;
        size_t outBufLeft = sizeof(outBuf);

        int rc = iconv(iconvHandler, &inPtr, &inBufLeft, &outPtr, &outBufLeft);
        if (rc == -1 && (errno == EILSEQ || errno == E2BIG)) {
            perror("iconv");
            break;
        }
        fwrite(outBuf, sizeof(char), sizeof(outBuf) - outBufLeft, outFp);

        if (inBufLeft > 0) {
            strncpy(inBuf, inPtr, inBufLeft);
        }
    }
    iconv_close(iconvHandler);

    /* ファイルのクローズ */
    fclose(inFp);
    fclose(outFp);

    return 0;
}

iconvの挙動を理解するのにすごい苦労した。というか、ネット上に転がっていたサンプルをいくつか試してみたが、入力データにちょっと細工したりするとすぐにテストケースでNGが出たりして使えないということになり、結局manページを熟読して理解したという・・・

Man page of ICONV

C++

C言語で書いたプログラムのうち、ファイル入出力の部分をC++版にしてみたもの。

#include <fstream>
#include <iostream>
#include <string>
#include <cstring>
#include <iconv.h>
#include <limits.h>
#include <errno.h>

#define BUF_SIZE 1024

using namespace std;

int main(int argc, char** argv) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s dirname\n", argv[0]);
        return EXIT_FAILURE;
    }

    /* ファイルのオープン */
    char inFile[PATH_MAX];
    sprintf(inFile, "%s/in.txt", argv[1]);
    char outFile[PATH_MAX];
    sprintf(outFile, "%s/out.txt", argv[1]);

    ifstream inFs(inFile, ios::binary);
    if (inFs.fail()) {
        perror(inFile);
        return EXIT_FAILURE;
    }
    ofstream outFs(outFile, ios::binary);
    if (outFs.fail()) {
        perror(outFile);    
        return EXIT_FAILURE;
    }

    /* 文字エンコーディング変換の準備 */
    iconv_t iconvHandler = iconv_open("CP932", "EUC-JP");

    /* 入力を読み込んで文字エンコーディング変換して出力 */
    char inBuf[BUF_SIZE / 2];
    size_t inBufLeft = 0;
    char outBuf[BUF_SIZE * 2];
    streamsize len;
    while ((len = inFs.readsome(inBuf + inBufLeft, sizeof(inBuf) - inBufLeft)) + inBufLeft > 0) {
        inBufLeft += len;
        char* inPtr = inBuf;
        char* outPtr = outBuf;
        size_t outBufLeft = sizeof(outBuf);

        int rc = iconv(iconvHandler, &inPtr, &inBufLeft, &outPtr, &outBufLeft);
        if (rc == -1 && (errno == EILSEQ || errno == E2BIG)) {
            perror("iconv");
            break;
        }
        outFs.write(outBuf, sizeof(outBuf) - outBufLeft);

        if (inBufLeft > 0) {
            strncpy(inBuf, inPtr, inBufLeft);
        }
    }
    iconv_close(iconvHandler);

    return EXIT_SUCCESS;
}

PHP

2パターン思いついたので、それぞれ書いてみた。

パターン1
- ファイルI/Oは、fopen / fgets / fputs / fclose
- 文字エンコーディング変換は、mb_convert_encoding
パターン2
- ファイルI/Oは、file_get_contents / file_put_contents
- 文字エンコーディング変換は、iconv

パターン1

<?php

if (count($argv) < 2) {
    file_put_contents('php://stderr', "Usage: php {$argv[0]} dirname" . PHP_EOL);
    exit(1);
}
$dir = $argv[1];
if (!file_exists($dir) || !is_dir($dir)) {
    file_put_contents('php://stderr', "{$dir}: No such directory." . PHP_EOL);
    exit(1);
}

$inFile = "{$dir}/in.txt";
$outFile = "{$dir}/out.txt";

$inFp = fopen($inFile, 'rb');
if (!$inFp) {
    file_put_contents('php://stderr', "{$inFile}: Cannot open file." . PHP_EOL);
    exit(1);
}
$outFp = fopen($outFile, 'wb');
if (!$outFp) {
    file_put_contents('php://stderr', "{$outFile}: Cannot open file." . PHP_EOL);
    fclose($inFp);
    exit(1);
}

while (($line = fgets($inFp))) {
    $line = mb_convert_encoding($line, 'SJIS-win', 'eucJP-win');
    fputs($outFp, $line);
}

fclose($inFp);
fclose($outFp);

パターン2

<?php

if (count($argv) < 2) {
    file_put_contents('php://stderr', "Usage: php {$argv[0]} dirname" . PHP_EOL);
    exit(1);
}

$dir = $argv[1];
if (!is_dir($dir)) {
    file_put_contents('php://stderr', "{$dir}: No such directory." . PHP_EOL);
    exit(1);
}

$inFile = "{$dir}/in.txt";
$outFile = "{$dir}/out.txt";
if (!file_exists($inFile)) {
    file_put_contents('php://stderr', "{$inFile}: No such file." . PHP_EOL);
    exit(1);
}

$str = file_get_contents($inFile);

$str = iconv('eucJP-win', 'SJIS-win', $str);

file_put_contents($outFile, $str);

Python 2

import sys
import os
import codecs

if len(sys.argv) < 2:
    sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n")
    exit(1)

dirname = sys.argv[1]

inFile = dirname + "/in.txt"
outFile = dirname + "/out.txt"
if not os.path.isfile(inFile):
    sys.stderr.write(inFile + ": No such file\n")
    exit(1)

inFp = codecs.open(inFile, 'rb', 'EUC-JP')
outFp = codecs.open(outFile, 'wb', 'CP932')

for line in inFp:
    outFp.write(line)

inFp.close()
outFp.close()

Python 3

import sys
import os
import codecs

if len(sys.argv) < 2:
    sys.stderr.write("Usage: " + sys.argv[0] + " dirname\n")
    exit(1)

dirname = sys.argv[1]

inFile = dirname + "/in.txt"
outFile = dirname + "/out.txt"
if not os.path.isfile(inFile):
    sys.stderr.write(inFile + ": No such file\n")
    exit(1)

inFp = codecs.open(inFile, 'rb', 'EUC-JP')
outFp = codecs.open(outFile, 'wb', 'CP932')

for line in inFp:
    outFp.write(line)

inFp.close()
outFp.close()

Python 2と何ら変わりはない。

Ruby

if ARGV.length != 1
    STDERR.puts('Usage: ' + __FILE__ + ' dirname')
    exit 1
end

dir = ARGV[0]

inFile = dir + '/in.txt'
outFile = dir + '/out.txt'
if !File.file?(inFile)
    STDERR.puts(inFile + ': No such file')
    exit 1
end

inFp = File.open(inFile, mode = 'rb')
outFp = File.open(outFile, mode = 'wb')

inFp.each_line{|line|
    line.encode!('CP932', 'EUC-JP')
    outFp.puts(line)
}

inFp.close()
outFp.close()

Perl

パターン1
- ファイルオープン時に文字エンコーディングを指定
パターン2
- 読み込んだ文字列をencode / decodeで文字エンコーディング変換

パターン1

if (@ARGV < 1) {
    die("Usage: " . __FILE__ . " dirname");
}

my $dir = $ARGV[0];
my $inFile = $dir . "/in.txt";
my $outFile = $dir . "/out.txt";

open(inFp, "<:encoding(EUC-JP)", $inFile) or die($inFile . ": $!");
open(outFp, ">:encoding(CP932)", $outFile) or die($outFile . ": $!");

while (my $line = <inFp>) {
    print outFp $line;
}

close(inFp);
close(outFp);

パターン2

use Encode;

if (@ARGV < 1) {
    die("Usage: " . __FILE__ . " dirname");
}

my $dir = $ARGV[0];
my $inFile = $dir . "/in.txt";
my $outFile = $dir . "/out.txt";

open(inFp, "<", $inFile) or die($inFile . ": $!");
open(outFp, ">", $outFile) or die($outFile . ": $!");

while (my $line = <inFp>) {
    $line = encode('CP932', decode('EUC-JP', $line));
    print outFp $line;
}

close(inFp);
close(outFp);

Go

package main

import (
    "fmt"
    "os"
    "io"
    "bufio"

    "golang.org/x/text/encoding/japanese"
    "golang.org/x/text/transform"
)

func main() {
    if len(os.Args) < 2 {
        fmt.Fprintln(os.Stderr, "Usage: " + os.Args[0] + " dirname")
        os.Exit(1)
    }

    dir := os.Args[1]

    inFile := dir + "/in.txt"
    outFile := dir + "/out.txt"

    inFp, err := os.Open(inFile)
    if err != nil {
        panic(err)
    }
    outFp, err := os.Create(outFile)
    if err != nil {
        inFp.Close()
        panic(err)
    }

    in := bufio.NewReader(transform.NewReader(inFp, japanese.EUCJP.NewDecoder()))
    out := bufio.NewWriter(transform.NewWriter(outFp, japanese.ShiftJIS.NewEncoder()))

    b := make([]byte, 1024)
    for {
        n, err := in.Read(b)
        if err == io.EOF {
            break
        } else if err != nil {
            inFp.Close()
            outFp.Close()
            panic(err)
        }
        out.Write(b[0:n])
    }
    out.Flush()

    inFp.Close()
    outFp.Close()
}

これをやるには「GOPATH=$(pwd) go get golang.org/x/text/encoding/japanese」しておく必要がある。更に実行時に「GOPATH=$(pwd) go run Main.go」することも忘れずに。

Golang による文字エンコーディング変換 - Qiita

bash

#! /bin/bash

dir="${1}"
if [ -z "${dir}" ]; then
    echo "Usage: $0 dirname" >> /dev/stderr
    exit 1
fi
if [ ! -d "${dir}" ]; then
    echo "${dir}: No such directory." >> /dev/stderr
    exit 1
fi

iconv -f EUC-JP -t CP932 < "${dir}/in.txt" > "${dir}/out.txt"