HHeLiBeXの日記 正道編

日々の記憶の記録とメモ‥

CSVファイルの扱いに関する挙動の比較

CSVファイルの読み書きに関する挙動をまとめたメモ。

はじめに

まず、復習がてらRFC 4180を見直す。

BNF表記の部分を抜き出す。

The ABNF grammar [2] appears as follows:
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = %x2C
CR = %x0D ;as per section 6.1 of RFC 2234 [2]
DQUOTE = %x22 ;as per section 6.1 of RFC 2234 [2]
LF = %x0A ;as per section 6.1 of RFC 2234 [2]
CRLF = CR LF ;as per section 6.1 of RFC 2234 [2]
TEXTDATA = %x20-21 / %x23-2B / %x2D-7E

TEXTDATAの部分がASCII文字のみで構成されているのは時代だから仕方ないとして、ここでは空白文字(0x20)も対象に入っていることに着目したい。つまり、ダブルクォーテーションで括ろうが括るまいが、空白文字は値の一部として扱われるべきというのがRFC 4180の主張である。

検証に使うCSVファイル

基本的なCSV形式のデータを含むファイル。

a,b,c,d,e
A,B,C,D,E
"カンマ(comma)",",","","でぃ",""
"ダブルクォーテーション(double quotation)","""","","でぃ",""
"改行","
","","でぃ",""
"空白文字の扱い(white space) 1", B ,"","でぃ",""
"空白文字の扱い(white space) 2"," B ","","でぃ",""

空白文字の扱い(white space) 1/2は、ダブルクォーテーションで括るか括らないかで両端の空白文字の扱いに変化が出るかどうかを確認するためのデータ。

  • test-backslash.csv

バックスラッシュをエスケープ文字とする「fgetcsv($stream)」のような例があるので、あえて分けて検証する。

a,b,c,d,e
"バックスラッシュ(back slash)","\","","でぃ",""
A,B,C,D,E
  • invalid-test1.csv

不正ケースの1。

a,b,c,d,e
"不正な値(invalid value) 1","びぃ\"B","","でぃ",""
A,B,C,D,E

バックスラッシュでエスケープしているつもりで、本来は単にダブルクォーテーションが余分にあるというデータ。

  • invalid-test2.csv

不正ケースの2。

a,b,c,d,e
"不正な値(invalid value) 2","びぃ"B"びぃ","","でぃ",""
A,B,C,D,E

RFC 4180によると、

field = (escaped / non-escaped)
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA

なので、上記の「"びぃ"B"びぃ"」の部分は「escaped non-escaped escaped」となっていて、本来なら構文エラーになるはずのデータ。

fgetcsv/fputcsv (PHP)

fgetcsvについては以下。

なお、str_getcsvは基本的にfgetcsvと同じなので、ここでは省略する。

<?php

mb_internal_encoding('UTF-8');

$header = array();

while (($row = fgetcsv(STDIN, 0, ',', '"', '"'))) {
    if (empty($header)) {
        $header = $row;
        fputcsv(STDOUT, $header);
    } else {
        $ss = $row;
        $a = array();
        for ($i = 0; $i < count($header); ++$i) {
            if (isset($ss[$i])) {
                $a[$header[$i]] = mb_convert_encoding($ss[$i], mb_internal_encoding(), 'SJIS-win');
            } else {
                $a[$header[$i]] = 'N/A';
            }
        }
        var_dump($a);
        fputcsv(STDOUT, $a);
    }
}

実行結果。

$ php Main.php < test.csv
a,b,c,d,e
array(5) {
  ["a"]=>
  string(1) "A"
  ["b"]=>
  string(1) "B"
  ["c"]=>
  string(1) "C"
  ["d"]=>
  string(1) "D"
  ["e"]=>
  string(1) "E"
}
A,B,C,D,E
array(5) {
  ["a"]=>
  string(16) "カンマ(comma)"
  ["b"]=>
  string(1) ","
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
カンマ(comma),",",,でぃ,
array(5) {
  ["a"]=>
  string(51) "ダブルクォーテーション(double quotation)"
  ["b"]=>
  string(1) """
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"ダブルクォーテーション(double quotation)","""",,でぃ,
array(5) {
  ["a"]=>
  string(6) "改行"
  ["b"]=>
  string(2) "
"
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
改行,"
",,でぃ,
array(5) {
  ["a"]=>
  string(36) "空白文字の扱い(white space) 1"
  ["b"]=>
  string(3) " B "
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"空白文字の扱い(white space) 1"," B ",,でぃ,
array(5) {
  ["a"]=>
  string(36) "空白文字の扱い(white space) 2"
  ["b"]=>
  string(3) " B "
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"空白文字の扱い(white space) 2"," B ",,でぃ,
$ 
$ php Main.php < test-backslash.csv
a,b,c,d,e
array(5) {
  ["a"]=>
  string(36) "バックスラッシュ(back slash)"
  ["b"]=>
  string(1) "\"
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"バックスラッシュ(back slash)","\",,でぃ,
array(5) {
  ["a"]=>
  string(1) "A"
  ["b"]=>
  string(1) "B"
  ["c"]=>
  string(1) "C"
  ["d"]=>
  string(1) "D"
  ["e"]=>
  string(1) "E"
}
A,B,C,D,E
$ 
$ php Main.php < invalid-test1.csv
a,b,c,d,e
array(5) {
  ["a"]=>
  string(29) "不正な値(invalid value) 1"
  ["b"]=>
  string(9) "びぃ\B""
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"不正な値(invalid value) 1","びぃ\B""",,でぃ,
array(5) {
  ["a"]=>
  string(1) "A"
  ["b"]=>
  string(1) "B"
  ["c"]=>
  string(1) "C"
  ["d"]=>
  string(1) "D"
  ["e"]=>
  string(1) "E"
}
A,B,C,D,E
$ 
$ php Main.php < invalid-test2.csv
a,b,c,d,e
array(5) {
  ["a"]=>
  string(29) "不正な値(invalid value) 2"
  ["b"]=>
  string(15) "びぃB"びぃ""
  ["c"]=>
  string(0) ""
  ["d"]=>
  string(6) "でぃ"
  ["e"]=>
  string(0) ""
}
"不正な値(invalid value) 2","びぃB""びぃ""",,でぃ,
array(5) {
  ["a"]=>
  string(1) "A"
  ["b"]=>
  string(1) "B"
  ["c"]=>
  string(1) "C"
  ["d"]=>
  string(1) "D"
  ["e"]=>
  string(1) "E"
}
A,B,C,D,E
$ 

test.csvに関しては、fgetcsvのescapeを指定することにより、特に問題なく処理されている。

test-backslash.csvについては、「fgetcsv(STDIN, ',', '"', '"')」とエスケープ文字を変えているので問題なし。

invalid-test1.csvは謎。なぜ「"びぃ\"B"」を読み込んだ結果が「びぃ\B"」となるのか。

invalid-test2.csvも謎。せめて「びぃBびぃ」となってほしいところが「びぃB"びぃ"」となっている。どういうロジックでこうなっているんだろうか。

Super CSV (Java)

Super CSV自体についての詳細は以下。

Super CSVで一番シンプルにCSVデータの読み書きをするにはorg.supercsv.io.CsvListReaderとorg.supercsv.io.CsvListWriterを使う。

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.Arrays;
import java.util.List;

import org.supercsv.exception.SuperCsvException;
import org.supercsv.io.CsvListReader;
import org.supercsv.io.CsvListWriter;
import org.supercsv.io.ICsvListReader;
import org.supercsv.io.ICsvListWriter;
import org.supercsv.prefs.CsvPreference;

public class Main {
    public static void main(String[] args) throws IOException {
        CsvPreference csvPref = new CsvPreference.Builder(
            CsvPreference.STANDARD_PREFERENCE).surroundingSpacesNeedQuotes(true).build();
        ICsvListReader reader = new CsvListReader(
            new BufferedReader(new InputStreamReader(System.in, "Windows-31J")), csvPref);
        ICsvListWriter writer = new CsvListWriter(
            new OutputStreamWriter(System.out), csvPref);
        String[] header = reader.getHeader(true);
        System.out.println(Arrays.toString(header));
        while (true) {
            List<String> row;
            try {
                if ((row = reader.read()) == null) {
                    break;
                }
            } catch (SuperCsvException e) {
                e.printStackTrace();
                continue;
            }

            System.out.println();
            int i = 0;
            for (String val : row) {
                System.out.println((header.length >= i + 1 ? header[i++] : "N/A")
                    + " => " + (val != null ? "\"" + val + "\"" : val));
            }
            writer.write(row);
            writer.flush();
        }
    }
}

実行結果。

[super-csv]$ mvn package
[super-csv]$ 
[super-csv]$ mvn exec:java < ../test.csv
[a, b, c, d, e]

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E

a => "カンマ(comma)"
b => ","
c => null
d => "でぃ"
e => null
カンマ(comma),",",,でぃ,

a => "ダブルクォーテーション(double quotation)"
b => """
c => null
d => "でぃ"
e => null
ダブルクォーテーション(double quotation),"""",,でぃ,

a => "改行"
b => "
"
c => null
d => "でぃ"
e => null
改行,"
",,でぃ,

a => "空白文字の扱い(white space) 1"
b => "B"
c => null
d => "でぃ"
e => null
空白文字の扱い(white space) 1,B,,でぃ,

a => "空白文字の扱い(white space) 2"
b => " B "
c => null
d => "でぃ"
e => null
空白文字の扱い(white space) 2," B ",,でぃ,
[super-csv]$ 
[super-csv]$ mvn exec:java < ../test-backslash.csv
[a, b, c, d, e]

a => "バックスラッシュ(back slash)"
b => "\"
c => null
d => "でぃ"
e => null
バックスラッシュ(back slash),\,,でぃ,

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E
[super-csv]$ 
[super-csv]$ mvn exec:java < ../invalid-test1.csv
[a, b, c, d, e]
org.supercsv.exception.SuperCsvException: unexpected end of file while reading quoted column beginning on line 2 and ending on line 3
context=null
    at org.supercsv.io.Tokenizer.readColumns(Tokenizer.java:162)
    at org.supercsv.io.AbstractCsvReader.readRow(AbstractCsvReader.java:179)
    at org.supercsv.io.CsvListReader.read(CsvListReader.java:69)
    at Main.main(Main.java:28)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[super-csv]$ 
[super-csv]$ mvn exec:java < ../invalid-test2.csv
[a, b, c, d, e]

a => "不正な値(invalid value) 2"
b => "びぃBびぃ"
c => null
d => "でぃ"
e => null
不正な値(invalid value) 2,びぃBびぃ,,でぃ,

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E
[super-csv]$ 

test.csvでは、「空白文字の扱い(white space) 1」で前後の空白文字が無視されている。

test-backslash.csvは問題なし。

invalid-test1.csvでは、期待通り構文エラーだとしてorg.supercsv.exception.SuperCsvExceptionが投げられている。ただ、これはjava.lang.RuntimeExceptionを継承していることに注意しないといけない。

invalid-test2.csvは、RFC 4180違反ではあるけど、「びぃBびぃ」として扱っていて、まあまあ。

Super CSV Annotation (Java)

Super CSV Annotation自体についての詳細は以下。

Super CSV Annotationでは、CSVデータに合わせたBeanクラスを作る。それ以外の部分は基本的にSuper CSVをベースとしているので、まぁ結果は予想がつくが。

import com.github.mygreen.supercsv.annotation.CsvBean;
import com.github.mygreen.supercsv.annotation.CsvColumn;

@CsvBean(header=true)
public class HogeBean {
    @CsvColumn(number=1)
    private String fieldA;
    @CsvColumn(number=2, label="びぃ")
    private String fieldB;
    @CsvColumn(number=3)
    private String fieldC;
    @CsvColumn(number=4)
    private String fieldD;
    @CsvColumn(number=5)
    private String fieldE;

    public HogeBean() {
    }

    public String getFieldA() {
        return fieldA;
    }
    public void setFieldA(String fieldA) {
        this.fieldA = fieldA;
    }

    public String getFieldB() {
        return fieldB;
    }
    public void setFieldB(String fieldB) {
        this.fieldB = fieldB;
    }

    public String getFieldC() {
        return fieldC;
    }
    public void setFieldC(String fieldC) {
        this.fieldC = fieldC;
    }

    public String getFieldD() {
        return fieldD;
    }
    public void setFieldD(String fieldD) {
        this.fieldD = fieldD;
    }

    public String getFieldE() {
        return fieldE;
    }
    public void setFieldE(String fieldE) {
        this.fieldE = fieldE;
    }
}
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.Arrays;

import com.github.mygreen.supercsv.io.CsvAnnotationBeanReader;
import com.github.mygreen.supercsv.io.CsvAnnotationBeanWriter;

import org.supercsv.exception.SuperCsvException;
import org.supercsv.prefs.CsvPreference;

public class Main {
    public static void main(String[] args) throws IOException {
        CsvPreference csvPref = new CsvPreference.Builder(
            CsvPreference.STANDARD_PREFERENCE).surroundingSpacesNeedQuotes(true).build();
        CsvAnnotationBeanReader<HogeBean> reader = new CsvAnnotationBeanReader<>(
            HogeBean.class,
            new BufferedReader(new InputStreamReader(System.in, "Windows-31J")),
            csvPref);
        CsvAnnotationBeanWriter<HogeBean> writer = new CsvAnnotationBeanWriter<>(
            HogeBean.class,
            new OutputStreamWriter(System.out),
            csvPref);
        String[] header = reader.getHeader(true);
        System.out.println(Arrays.toString(header));
        while (true) {
            HogeBean hoge;
            try {
                if ((hoge = reader.read()) == null) {
                    break;
                }
            } catch (SuperCsvException e) {
                e.printStackTrace();
                continue;
            }

            System.out.println();
            System.out.println((header.length >= 1 ? header[0] : "N/A")
                + " => " + (hoge.getFieldA() != null ? "\"" + hoge.getFieldA() + "\"" : null));
            System.out.println((header.length >= 2 ? header[1] : "N/A")
                + " => " + (hoge.getFieldB() != null ? "\"" + hoge.getFieldB() + "\"" : null));
            System.out.println((header.length >= 3 ? header[2] : "N/A")
                + " => " + (hoge.getFieldC() != null ? "\"" + hoge.getFieldC() + "\"" : null));
            System.out.println((header.length >= 4 ? header[3] : "N/A")
                + " => " + (hoge.getFieldD() != null ? "\"" + hoge.getFieldD() + "\"" : null));
            System.out.println((header.length >= 5 ? header[4] : "N/A")
                + " => " + (hoge.getFieldE() != null ? "\"" + hoge.getFieldE() + "\"" : null));
            writer.write(hoge);
            writer.flush();
        }
    }
}

実行結果。

[super-csv-annotation]$ mvn package
[super-csv-annotation]$ 
[super-csv-annotation]$ mvn exec:java < ../test.csv
[a, b, c, d, e]

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E

a => "カンマ(comma)"
b => ","
c => null
d => "でぃ"
e => null
カンマ(comma),",",,でぃ,

a => "ダブルクォーテーション(double quotation)"
b => """
c => null
d => "でぃ"
e => null
ダブルクォーテーション(double quotation),"""",,でぃ,

a => "改行"
b => "
"
c => null
d => "でぃ"
e => null
改行,"
",,でぃ,

a => "空白文字の扱い(white space) 1"
b => "B"
c => null
d => "でぃ"
e => null
空白文字の扱い(white space) 1,B,,でぃ,

a => "空白文字の扱い(white space) 2"
b => " B "
c => null
d => "でぃ"
e => null
空白文字の扱い(white space) 2," B ",,でぃ,
[super-csv-annotation]$ 
[super-csv-annotation]$ mvn exec:java < ../test-backslash.csv
[a, b, c, d, e]

a => "バックスラッシュ(back slash)"
b => "\"
c => null
d => "でぃ"
e => null
バックスラッシュ(back slash),\,,でぃ,

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E
[super-csv-annotation]$ 
[super-csv-annotation]$ mvn exec:java < ../invalid-test1.csv
[a, b, c, d, e]
org.supercsv.exception.SuperCsvException: unexpected end of file while reading quoted column beginning on line 2 and ending on line 3
context=null
    at org.supercsv.io.Tokenizer.readColumns(Tokenizer.java:162)
    at org.supercsv.io.AbstractCsvReader.readRow(AbstractCsvReader.java:179)
    at com.github.mygreen.supercsv.io.AbstractCsvAnnotationBeanReader.read(AbstractCsvAnnotationBeanReader.java:90)
    at Main.main(Main.java:30)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[super-csv-annotation]$ 
[super-csv-annotation]$ mvn exec:java < ../invalid-test2.csv
[a, b, c, d, e]

a => "不正な値(invalid value) 2"
b => "びぃBびぃ"
c => null
d => "でぃ"
e => null
不正な値(invalid value) 2,びぃBびぃ,,でぃ,

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
A,B,C,D,E
[super-csv-annotation]$ 

予想通り、ですよねー、という結果。

OrangeSignal CSV (Java)

OrangeSignal CSV自体についての詳細は以下。

クイックスタートを見ると、Super CSV Annotationと同様にBeanクラスを作るやり方のようだ。クイックスタートの通りにpom.xmlを書くとエラーを吐くので、以下のように書く。

<dependency>
  <groupId>com.orangesignal</groupId>
  <artifactId>orangesignal-csv</artifactId>
  <version>2.2.1</version>
</dependency>
import com.orangesignal.csv.annotation.CsvColumn;
import com.orangesignal.csv.annotation.CsvEntity;

@CsvEntity(header=true)
public class HogeBean {
    @CsvColumn(name="a")
    private String fieldA;
    @CsvColumn(name="b")
    private String fieldB;
    @CsvColumn(name="c")
    private String fieldC;
    @CsvColumn(name="d")
    private String fieldD;
    @CsvColumn(name="e")
    private String fieldE;

    public HogeBean() {
    }

    public String getFieldA() {
        return fieldA;
    }
    public void setFieldA(String fieldA) {
        this.fieldA = fieldA;
    }

    public String getFieldB() {
        return fieldB;
    }
    public void setFieldB(String fieldB) {
        this.fieldB = fieldB;
    }

    public String getFieldC() {
        return fieldC;
    }
    public void setFieldC(String fieldC) {
        this.fieldC = fieldC;
    }

    public String getFieldD() {
        return fieldD;
    }
    public void setFieldD(String fieldD) {
        this.fieldD = fieldD;
    }

    public String getFieldE() {
        return fieldE;
    }
    public void setFieldE(String fieldE) {
        this.fieldE = fieldE;
    }
}
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.util.List;

import com.orangesignal.csv.annotation.CsvColumnException;
import com.orangesignal.csv.CsvConfig;
import com.orangesignal.csv.CsvReader;
import com.orangesignal.csv.CsvWriter;
import com.orangesignal.csv.io.CsvEntityReader;
import com.orangesignal.csv.io.CsvEntityWriter;

public class Main {
    public static void main(String[] args) throws IOException {
        CsvConfig cfg = new CsvConfig(',', '"', '"');
        CsvEntityReader<HogeBean> reader = CsvEntityReader.newInstance(new CsvReader(
            new BufferedReader(new InputStreamReader(System.in, "Windows-31J")), cfg), HogeBean.class);
        CsvEntityWriter<HogeBean> writer = CsvEntityWriter.newInstance(new CsvWriter(
            new OutputStreamWriter(System.out), cfg), HogeBean.class);
        List<String> header = reader.getHeader();
        System.out.println(header);
        while (true) {
            HogeBean hoge;
            try {
                if ((hoge = reader.read()) == null) {
                    break;
                }
            } catch (CsvColumnException e) {
                e.printStackTrace();
                continue;
            }

            System.out.println();
            System.out.println((header.size() >= 1 ? header.get(0) : "N/A")
                + " => " + (hoge.getFieldA() != null ? "\"" + hoge.getFieldA() + "\"" : null));
            System.out.println((header.size() >= 2 ? header.get(1) : "N/A")
                + " => " + (hoge.getFieldB() != null ? "\"" + hoge.getFieldB() + "\"" : null));
            System.out.println((header.size() >= 3 ? header.get(2) : "N/A")
                + " => " + (hoge.getFieldC() != null ? "\"" + hoge.getFieldC() + "\"" : null));
            System.out.println((header.size() >= 4 ? header.get(3) : "N/A")
                + " => " + (hoge.getFieldD() != null ? "\"" + hoge.getFieldD() + "\"" : null));
            System.out.println((header.size() >= 5 ? header.get(4) : "N/A")
                + " => " + (hoge.getFieldE() != null ? "\"" + hoge.getFieldE() + "\"" : null));
            writer.write(hoge);
            writer.flush();
        }
    }
}

実行結果。

[orange-signal-csv]$ mvn package
[orange-signal-csv]$ 
[orange-signal-csv]$ mvn exec:java < ../test.csv
[a, b, c, d, e]

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
"a","b","c","d","e"
"A","B","C","D","E"

a => "カンマ(comma)"
b => ","
c => ""
d => "でぃ"
e => ""
"カンマ(comma)",",","","でぃ",""

a => "ダブルクォーテーション(double quotation)"
b => """
c => ""
d => "でぃ"
e => ""
"ダブルクォーテーション(double quotation)","""","","でぃ",""

a => "改行"
b => "
"
c => ""
d => "でぃ"
e => ""
"改行","
","","でぃ",""

a => "空白文字の扱い(white space) 1"
b => " B "
c => ""
d => "でぃ"
e => ""
"空白文字の扱い(white space) 1"," B ","","でぃ",""

a => "空白文字の扱い(white space) 2"
b => " B "
c => ""
d => "でぃ"
e => ""
"空白文字の扱い(white space) 2"," B ","","でぃ",""
[WARNING] 
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at com.orangesignal.csv.io.CsvEntityReader.convert(CsvEntityReader.java:300)
    at com.orangesignal.csv.io.CsvEntityReader.read(CsvEntityReader.java:198)
    at Main.main(Main.java:26)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[orange-signal-csv]$ 
[orange-signal-csv]$ mvn exec:java < ../test-backslash.csv
[a, b, c, d, e]

a => "バックスラッシュ(back slash)"
b => "\"
c => ""
d => "でぃ"
e => ""
"a","b","c","d","e"
"バックスラッシュ(back slash)","\","","でぃ",""

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
"A","B","C","D","E"
[WARNING] 
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at com.orangesignal.csv.io.CsvEntityReader.convert(CsvEntityReader.java:300)
    at com.orangesignal.csv.io.CsvEntityReader.read(CsvEntityReader.java:198)
    at Main.main(Main.java:26)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[orange-signal-csv]$ 
[orange-signal-csv]$ mvn exec:java < ../invalid-test1.csv
[a, b, c, d, e]

a => "不正な値(invalid value) 1"
b => "びぃ\"B"
c => ""
d => "でぃ"
e => ""
"a","b","c","d","e"
"不正な値(invalid value) 1","びぃ\""B","","でぃ",""

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
"A","B","C","D","E"
[WARNING] 
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at com.orangesignal.csv.io.CsvEntityReader.convert(CsvEntityReader.java:300)
    at com.orangesignal.csv.io.CsvEntityReader.read(CsvEntityReader.java:198)
    at Main.main(Main.java:26)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[orange-signal-csv]$ 
[orange-signal-csv]$ mvn exec:java < ../invalid-test2.csv
[a, b, c, d, e]

a => "不正な値(invalid value) 2"
b => "びぃ"B"びぃ"
c => ""
d => "でぃ"
e => ""
"a","b","c","d","e"
"不正な値(invalid value) 2","びぃ""B""びぃ","","でぃ",""

a => "A"
b => "B"
c => "C"
d => "D"
e => "E"
"A","B","C","D","E"
[WARNING] 
java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at com.orangesignal.csv.io.CsvEntityReader.convert(CsvEntityReader.java:300)
    at com.orangesignal.csv.io.CsvEntityReader.read(CsvEntityReader.java:198)
    at Main.main(Main.java:26)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
[orange-signal-csv]$ 

いろいろと微妙だ。

まず、以下を生成して渡してやらないと、囲み文字(")すら有効にならないらしいのでそれは指定したが。

        CsvConfig cfg = new CsvConfig(',', '"', '"');

いずれのファイルに対しても、ファイル末尾の改行コードが邪魔をして、その後ろを空文字列から成る行として扱い、IndexOutOfBoundsExceptionを吐く。これ、OrangeSignal CSVで出力したCSVファイルを食わせたら例外を吐くのではないか?

test.csvとtest-backslash.csvについては特に問題はなさそう(CsvConfigでいろいろと挙動を変えられるらしいが)。

invalid-test1.csvは、なぜ「びぃ\"B」と解析できるのか謎。

invalid-test2.csvは、「びぃ"B"びぃ」となっていて、これも謎。

invalid-test1.csvもinvalid-test2.csvも、ダブルクォーテーションの次が区切り文字(カンマ)でなければ無視するという実装なんだろうか?

opencsv (Java)

opencsv自体についての詳細は以下。

こちらもBeanを作成する方法でやってみる。

import com.opencsv.bean.CsvBindByPosition;

public class HogeBean {
    @CsvBindByPosition(position=0)
    private String fieldA;
    @CsvBindByPosition(position=1)
    private String fieldB;
    @CsvBindByPosition(position=2)
    private String fieldC;
    @CsvBindByPosition(position=3)
    private String fieldD;
    @CsvBindByPosition(position=4)
    private String fieldE;

    public HogeBean() {
    }

    public String getFieldA() {
        return fieldA;
    }
    public void setFieldA(String fieldA) {
        this.fieldA = fieldA;
    }

    public String getFieldB() {
        return fieldB;
    }
    public void setFieldB(String fieldB) {
        this.fieldB = fieldB;
    }

    public String getFieldC() {
        return fieldC;
    }
    public void setFieldC(String fieldC) {
        this.fieldC = fieldC;
    }

    public String getFieldD() {
        return fieldD;
    }
    public void setFieldD(String fieldD) {
        this.fieldD = fieldD;
    }

    public String getFieldE() {
        return fieldE;
    }
    public void setFieldE(String fieldE) {
        this.fieldE = fieldE;
    }
}
import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.io.OutputStreamWriter;
import java.io.Writer;

import com.opencsv.bean.CsvToBean;
import com.opencsv.bean.CsvToBeanBuilder;
import com.opencsv.bean.StatefulBeanToCsv;
import com.opencsv.bean.StatefulBeanToCsvBuilder;
import com.opencsv.exceptions.CsvDataTypeMismatchException;
import com.opencsv.exceptions.CsvRequiredFieldEmptyException;

public class Main {
    public static void main(String[] args) throws IOException {
        CsvToBean<HogeBean> reader = new CsvToBeanBuilder<HogeBean>(
                new BufferedReader(new InputStreamReader(System.in, "Windows-31J"))
            ).withType(HogeBean.class).build();
        Writer w = new OutputStreamWriter(System.out); // flush()するために変数にセット。
        StatefulBeanToCsv<HogeBean> writer = new StatefulBeanToCsvBuilder<HogeBean>(
                w
            ).build();
        for (HogeBean hoge : reader) {
            System.out.println();
            System.out.println("[0] => "
                + (hoge.getFieldA() != null ? "\"" + hoge.getFieldA() + "\"" : null));
            System.out.println("[1] => "
                + (hoge.getFieldB() != null ? "\"" + hoge.getFieldB() + "\"" : null));
            System.out.println("[2] => "
                + (hoge.getFieldC() != null ? "\"" + hoge.getFieldC() + "\"" : null));
            System.out.println("[3] => "
                + (hoge.getFieldD() != null ? "\"" + hoge.getFieldD() + "\"" : null));
            System.out.println("[4] => "
                + (hoge.getFieldE() != null ? "\"" + hoge.getFieldE() + "\"" : null));
            try {
                writer.write(hoge);
                w.flush();
            } catch (CsvDataTypeMismatchException|CsvRequiredFieldEmptyException e) {
                e.printStackTrace();
            }
        }
    }
}

実行結果。

[opencsv]$ mvn package
[opencsv]$ 
[opencsv]$ mvn exec:java < ../test.csv

[0] => "a"
[1] => "b"
[2] => "c"
[3] => "d"
[4] => "e"
"a","b","c","d","e"

[0] => "A"
[1] => "B"
[2] => "C"
[3] => "D"
[4] => "E"
"A","B","C","D","E"

[0] => "カンマ(comma)"
[1] => ","
[2] => ""
[3] => "でぃ"
[4] => ""
"カンマ(comma)",",","","でぃ",""

[0] => "ダブルクォーテーション(double quotation)"
[1] => """
[2] => ""
[3] => "でぃ"
[4] => ""
"ダブルクォーテーション(double quotation)","""","","でぃ",""

[0] => "改行"
[1] => "
"
[2] => ""
[3] => "でぃ"
[4] => ""
"改行","
","","でぃ",""

[0] => "空白文字の扱い(white space) 1"
[1] => " B "
[2] => ""
[3] => "でぃ"
[4] => ""
"空白文字の扱い(white space) 1"," B ","","でぃ",""

[0] => "空白文字の扱い(white space) 2"
[1] => " B "
[2] => ""
[3] => "でぃ"
[4] => ""
"空白文字の扱い(white space) 2"," B ","","でぃ",""
[opencsv]$ 
[opencsv]$ mvn exec:java < ../test-backslash.csv
[WARNING] 
java.lang.RuntimeException: Error parsing CSV.
    at com.opencsv.bean.CsvToBean$CsvToBeanIterator.readSingleLine(CsvToBean.java:394)
    at com.opencsv.bean.CsvToBean$CsvToBeanIterator.next(CsvToBean.java:410)
    at Main.main(Main.java:23)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:282)
    at java.lang.Thread.run(Thread.java:748)
Caused by: com.opencsv.exceptions.CsvMalformedLineException: Unterminated quoted field at end of CSV line. Beginning of lost text: [",",でぃ,"
A,B,C,D,E
]
    at com.opencsv.CSVReader.primeNextRecord(CSVReader.java:245)
    at com.opencsv.CSVReader.flexibleRead(CSVReader.java:598)
    at com.opencsv.CSVReader.readNext(CSVReader.java:204)
    at com.opencsv.bean.concurrent.SingleLineReader.readNextLine(SingleLineReader.java:49)
    at com.opencsv.bean.CsvToBean$CsvToBeanIterator.readLineWithPossibleError(CsvToBean.java:364)
    at com.opencsv.bean.CsvToBean$CsvToBeanIterator.readSingleLine(CsvToBean.java:391)
    ... 8 more
[opencsv]$ 
[opencsv]$ mvn exec:java < ../invalid-test1.csv

[0] => "a"
[1] => "b"
[2] => "c"
[3] => "d"
[4] => "e"
"a","b","c","d","e"

[0] => "不正な値(invalid value) 1"
[1] => "びぃ"B"
[2] => ""
[3] => "でぃ"
[4] => ""
"不正な値(invalid value) 1","びぃ""B","","でぃ",""

[0] => "A"
[1] => "B"
[2] => "C"
[3] => "D"
[4] => "E"
"A","B","C","D","E"
[opencsv]$ 
[opencsv]$ mvn exec:java < ../invalid-test2.csv

[0] => "a"
[1] => "b"
[2] => "c"
[3] => "d"
[4] => "e"
"a","b","c","d","e"

[0] => "不正な値(invalid value) 2"
[1] => "びぃ"B"びぃ"
[2] => ""
[3] => "でぃ"
[4] => ""
"不正な値(invalid value) 2","びぃ""B""びぃ","","でぃ",""

[0] => "A"
[1] => "B"
[2] => "C"
[3] => "D"
[4] => "E"
"A","B","C","D","E"
[opencsv]$ 

test.csvは特に問題なし。

test-backslash.csvは、バックスラッシュがエスケープ文字だと扱われて構文エラーになっている。 最初、

        CsvToBean<HogeBean> reader = new CsvToBeanBuilder<HogeBean>(
                new BufferedReader(new InputStreamReader(System.in, "Windows-31J"))
            )
            .withSeparator(',')
            .withQuoteChar('"')
            .withEscapeChar('"')
            .withType(HogeBean.class).build();

と書いて試してみたけど、 java.lang.UnsupportedOperationException: The separator, quote, and escape characters must be different! と怒られてしまう。

invalid-test1.csvは、なぜか「びぃ"B」となる。

invalid-test2.csvも、なぜか「びぃ"B"びぃ」となる。

まとめ

RFC 4180をもとにした各方法の挙動は以下のようになる。

  1. ダブルクォーテーションによるクォート文字列の扱い
  2. クォート文字列中のダブルクォーテーションの扱い
  3. クォート文字列中の改行コードの扱い
  4. バックスラッシュの扱い(エスケープ文字として扱わないかどうか)
  5. 前後の空白文字の扱い
  6. ダブルクォーテーションが閉じていないケースの扱い
  7. クォート文字列とそれ以外が混在したケースの扱い
関数/ライブラリ 1 2 3 4 5 6 7
fgetcsv/fputcsv (PHP) × ×
Super CSV (Java)
Super CSV Annotation (Java)
OrangeSignal CSV (Java) ○* ○* ○* ○* ○* ×* ×*
opencsv (Java) × × ×

(*)末尾の改行コードが邪魔をして例外を投げる。

参考

おまけ(Excelで開く)

(2021年11月10日追記)

f:id:hhelibex:20211110164101p:plain

  • test-backslash.csv

f:id:hhelibex:20211110164149p:plain

  • invalid-test1.csv

f:id:hhelibex:20211110164221p:plain

  • invalid-test2.csv

f:id:hhelibex:20211110164240p:plain

Excelにとって辛いところは、「一般ユーザーが間違いを修正するために」何が何でも開けなきゃいけないところなんだろうな、と。