2017-07-25

PHPでのファイルアップロードの処理

PHP

ファイルアップロードのファイルサイズチェック - Qiitaを読んでて非常にもやっとしたので、PHPプログラム側のエラー処理を自分なりに整理してみたメモ。

ちなみに検証環境は訳あって以下だが、CentOS 7でも同様だと思う。

CentOS 6
- PHP 5.3.3

php.iniの設定は以下のようになっている。

;;;;;;;;;;;;;;;;;
; Data Handling ;
;;;;;;;;;;;;;;;;;

; Maximum size of POST data that PHP will accept.
; http://www.php.net/manual/en/ini.core.php#ini.post-max-size
post_max_size = 8M

;;;;;;;;;;;;;;;;
; File Uploads ;
;;;;;;;;;;;;;;;;

; Maximum allowed size for uploaded files.
; http://www.php.net/manual/en/ini.core.php#ini.upload-max-filesize
upload_max_filesize = 2M

送信側のHTMLおよび受信側のPHPプログラムをベタで書き起こすと以下のような感じになるだろう。

ファイル送信側のHTML

<!DOCTYPE html>
<html>
    <head>
        <meta charset="UTF-8" />
        <title>$_FILES test</title>
    </head>
    <body>
        <h1>[php.iniのupload_max_filesize] &gt; [hiddenのMAX_FILE_SIZE]</h1>
        <form action="upload.php" method="post" enctype="multipart/form-data">
            <input type="hidden" name="MAX_FILE_SIZE" value="1048576" />
            <input type="file" name="file1" />
            <input type="submit" value="アップロード1" />
        </form>
        <h1>[php.iniのupload_max_filesize] &lt; [hiddenのMAX_FILE_SIZE]</h1>
        <form action="upload.php" method="post" enctype="multipart/form-data">
            <input type="hidden" name="MAX_FILE_SIZE" value="3145728" />
            <input type="file" name="file1" />
            <input type="submit" value="アップロード2" />
        </form>
    </body>
</html>

受信側のPHPプログラム

<a href="./">Back to Home</a>
<hr />
<pre>
<?php
var_dump($_POST);
var_dump($_FILES);
?>
</pre>
<hr />
<?php
$param = 'file1';
if (isset($_FILES[$param])) {
    $err = $_FILES[$param]['error'];
    if ($err === UPLOAD_ERR_OK) {
        if (is_uploaded_file($_FILES[$param]['tmp_name'])) {
            print('OK。何の問題もありません。');
        } else {
            print('えっ！？何が起きた？');
        }
    } else if ($err === UPLOAD_ERR_INI_SIZE) {
        print('ファイルサイズが大きすぎます(php.iniのupload_max_filesize)。' . ini_get('upload_max_filesize') . 'バイト以下にしてください。');
    } else if ($err === UPLOAD_ERR_FORM_SIZE) {
        print('ファイルサイズが大きすぎます(formのMAX_FILE_SIZE)。' . $_POST['MAX_FILE_SIZE'] . 'バイト以下にしてください。');
    } else if ($err === UPLOAD_ERR_PARTIAL) {
        // XXX アップロード中にキャンセルした場合とか？
        print('「アップロードされたファイルは一部のみしかアップロードされていません。」');
    } else if ($err === UPLOAD_ERR_NO_FILE) {
        print('「ファイルはアップロードされませんでした。」');
    } else if ($err === UPLOAD_ERR_NO_TMP_DIR) {
        // XXX upload_tmp_dirで設定したディレクトリが存在しない場合かと思ったが、
        // 「ここで指定したディレクトリに書き込むことができない場合、 PHP はかわりにシステムのデフォルトテンポラリディレクトリを使用します。」らしい。
        print('「テンポラリフォルダがありません。」');
    } else if ($err === UPLOAD_ERR_CANT_WRITE) {
        // XXX ディスク容量不足の場合とか？
        print('「ディスクへの書き込みに失敗しました。」');
    } else if ($err === UPLOAD_ERR_EXTENSION) {
        print('「PHP の拡張モジュールがファイルのアップロードを中止しました。」');
    }
} else {
    // post_max_sizeを超えている
    print('ファイルサイズがあまりに大きすぎます。' . ini_get('upload_max_filesize') . 'バイト以下にしてください。');
}

とまぁ、結構めんどくさいことになるので、必要なエラーコードだけ拾ってきちんと処理し、残りは適当にあしらうのが良いのかなと。「UPLOAD_ERR_NO_TMP_DIR」とかなんか完全にサーバー側の設定の問題だしな。

参考

2017-06-09

ARRAY_TO_STRINGの結果に対してLIKE検索している部分が遅い問題の一つの解決案

PostgreSQL

割と複雑なクエリで4秒とか掛かるものにぶち当たり、explainを取ってみたところ、表題の通りARRAY_TO_STRINGした結果得られる文字列に対するLIKE検索をしている部分でコストが増大していることが分かり、軽く検証してみたメモ。

細かいところは追い追い書いていくとして、まずは準備。

環境

CentOS 7
- PostgreSQL 9.2

準備

以下のようなクエリを流して、テーブルおよびデータを作る。結構時間が掛かるので、実行する際には心してお待ちください。

#! /bin/bash

psql -U postgres -q -c 'DROP DATABASE IF EXISTS test_db'
psql -U postgres -q -c 'CREATE DATABASE test_db'
psql -U postgres -q -c 'CREATE TABLE IF NOT EXISTS user_master (
                              id BIGSERIAL PRIMARY KEY
                            , name VARCHAR(128) NOT NULL
                        )' test_db
psql -U postgres -q -c 'CREATE TABLE IF NOT EXISTS label_list (
                              id BIGSERIAL PRIMARY KEY
                            , user_id BIGINT NOT NULL
                            , label VARCHAR(128) NOT NULL
                        )' test_db
psql -U postgres -q -c 'CREATE INDEX ON label_list(user_id)' test_db
psql -U postgres -q -c 'CREATE INDEX ON label_list(label)' test_db

# 全ユーザ
for ((i = 1; i <= 10000; ++i)); do
        psql -U postgres -q -c "INSERT INTO user_master(name) VALUES('name$(printf %06d ${i})')" test_db
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'A')" test_db
done
# 奇数IDのユーザ
for ((i = 1; i <= 10000; i += 2)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'B')" test_db
done
# 3の倍数＋１のIDのユーザ
for ((i = 1; i <= 10000; i += 3)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'C')" test_db
done
# 5の倍数＋１のIDのユーザ
for ((i = 1; i <= 10000; i += 5)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'D')" test_db
done
# 素数IDのユーザ
for ((i = 1; i <= 10000; ++i)); do
        if [ $(factor ${i} | wc -w) -eq 2 ]; then
                psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'E')" test_db
        fi
done

説明は・・要るかな・・一応軽く書くと、user_masterテーブルに各ユーザの情報が格納されており、各ユーザに対して付けられているいくつかのラベルがlabel_listテーブルに1ラベル1行として格納されている、という構造。

クエリ1（書き換え前）

オリジナルのクエリからARRAY_TO_STRINGを使って検索している部分を抜き出したもの。

クエリ1-1

SELECT DISTINCT u.id
    FROM user_master u
    WHERE u.id IN (
        SELECT user_id
            FROM (
                SELECT DISTINCT
                      user_id
                    , ARRAY_TO_STRING(
                          ARRAY(
                            SELECT label
                                FROM label_list l2
                                WHERE l1.user_id = l2.user_id
                          )
                        , ',') label_str
                    FROM label_list l1
            ) l0
            WHERE label_str LIKE '%A%'
              AND label_str LIKE '%B%'
              AND label_str LIKE '%E%'
)

クエリ1-2

SELECT DISTINCT u.id
    FROM user_master u
    WHERE u.id IN (
        SELECT user_id
            FROM (
                SELECT DISTINCT
                      user_id
                    , ARRAY_TO_STRING(
                          ARRAY(
                            SELECT label
                                FROM label_list l2
                                WHERE l1.user_id = l2.user_id
                          )
                        , ',') label_str
                    FROM label_list l1
            ) l0
            WHERE label_str LIKE '%A%'
              AND label_str LIKE '%B%'
              AND label_str NOT LIKE '%C%'
              AND label_str NOT LIKE '%D%'
              AND label_str LIKE '%E%'
)

いずれも、ARRAY_TO_STRINGであるユーザにつけられているラベルをカンマ区切りの1つの文字列にしてLIKE検索をしている。

2つのクエリの違いはどのラベルを含むか含まないかの条件の数だけ。

クエリ2（書き換え後）

ARRAY_TO_STRINGを使わないようにするには、検索したいキーワードの数だけlabel_listをキーワードで絞り込んでJOINしていくしかない。

クエリ2-1

SELECT DISTINCT u.id, u.name
    FROM user_master u
    WHERE u.id IN (
        SELECT l0.user_id
            FROM label_list l0
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'A'
                    )
            ) l1
            ON l0.user_id = l1.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'B'
                    )
            ) l2
            ON l0.user_id = l2.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'E'
                    )
            ) l3
            ON l0.user_id = l3.user_id
)

クエリ2-2

SELECT DISTINCT u.id, u.name
    FROM user_master u
    WHERE u.id IN (
        SELECT l0.user_id
            FROM label_list l0
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'A'
                    )
            ) l1
            ON l0.user_id = l1.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'B'
                    )
            ) l2
            ON l0.user_id = l2.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id NOT IN (
                        SELECT user_id FROM label_list WHERE label = 'C'
                    )
            ) l3
            ON l0.user_id = l3.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id NOT IN (
                        SELECT user_id FROM label_list WHERE label = 'D'
                    )
            ) l4
            ON l0.user_id = l4.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'E'
                    )
            ) l5
            ON l0.user_id = l5.user_id
)

実行結果

全部を載せるのは無理があるので、件数だけ載せておく。

$ cat query1.sql query2.sql | psql -U postgres test_db | grep '^[(]'
(1228 行)
(463 行)
(1228 行)
(463 行)

実際には、両方のクエリで結果が同じになるということは検証してある。

explainの結果

最後にexplainを取った結果を載せておく。

クエリ1-1

                                                                                                          QUERY PLAN                                                                                                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1118038.76..1118416.52 rows=5000 width=8)
   ->  Merge Join  (cost=1118038.76..1118404.02 rows=5000 width=8)
         Merge Cond: (u.id = l1.user_id)
         ->  Index Only Scan using user_master_pkey on user_master u  (cost=0.00..337.25 rows=10000 width=8)
         ->  Sort  (cost=1118038.76..1118039.26 rows=200 width=8)
               Sort Key: l1.user_id
               ->  HashAggregate  (cost=1118029.12..1118031.12 rows=200 width=8)
                     ->  HashAggregate  (cost=997726.56..1117899.45 rows=10374 width=8)
                           ->  Seq Scan on label_list l1  (cost=0.00..997618.90 rows=21533 width=8)
                                 Filter: ((array_to_string((SubPlan 2), ','::text) ~~ '%A%'::text) AND (array_to_string((SubPlan 3), ','::text) ~~ '%B%'::text) AND (array_to_string((SubPlan 4), ','::text) ~~ '%E%'::text))
                                 SubPlan 1
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 2
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 3
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 4
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
(30 行)

クエリ1-2

                                                                                                                                                                        QUERY PLAN                                                                                                                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1247100.97..1247599.82 rows=1 width=8)
   ->  Nested Loop Semi Join  (cost=1247100.97..1247599.82 rows=1 width=8)
         Join Filter: (u.id = l0.user_id)
         ->  Index Only Scan using user_master_pkey on user_master u  (cost=0.00..337.25 rows=10000 width=8)
         ->  Materialize  (cost=1247100.97..1247112.57 rows=1 width=8)
               ->  Subquery Scan on l0  (cost=1247100.97..1247112.56 rows=1 width=8)
                     ->  HashAggregate  (cost=1247100.97..1247112.55 rows=1 width=8)
                           ->  Seq Scan on label_list l1  (cost=0.00..1247100.97 rows=1 width=8)
                                 Filter: ((array_to_string((SubPlan 2), ','::text) ~~ '%A%'::text) AND (array_to_string((SubPlan 3), ','::text) ~~ '%B%'::text) AND (array_to_string((SubPlan 4), ','::text) !~~ '%C%'::text) AND (array_to_string((SubPlan 5), ','::text) !~~ '%D%'::text) AND (array_to_string((SubPlan 6), ','::text) ~~ '%E%'::text))
                                 SubPlan 1
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 2
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 3
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 4
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 5
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 6
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
(39 行)

クエリ2-1

                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2545.00..2600.26 rows=5526 width=8)
   ->  Hash Semi Join  (cost=2194.33..2531.18 rows=5526 width=8)
         Hash Cond: (u.id = public.label_list.user_id)
         ->  Seq Scan on user_master u  (cost=0.00..164.00 rows=10000 width=8)
         ->  Hash  (cost=2125.25..2125.25 rows=5526 width=56)
               ->  Hash Join  (cost=1635.83..2125.25 rows=5526 width=56)
                     Hash Cond: (l0.user_id = public.label_list.user_id)
                     ->  Seq Scan on label_list l0  (cost=0.00..353.39 rows=21539 width=8)
                     ->  Hash  (cost=1603.07..1603.07 rows=2621 width=48)
                           ->  Nested Loop  (cost=301.13..1603.07 rows=2621 width=48)
                                 ->  Nested Loop  (cost=301.13..1162.32 rows=1243 width=40)
                                       ->  Nested Loop Semi Join  (cost=301.13..953.11 rows=590 width=32)
                                             ->  Hash Join  (cost=301.13..748.45 rows=590 width=24)
                                                   Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                   ->  Hash Join  (cost=27.87..466.52 rows=1165 width=16)
                                                         Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                         ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                                         ->  Hash  (cost=25.20..25.20 rows=213 width=8)
                                                               ->  HashAggregate  (cost=23.07..25.20 rows=213 width=8)
                                                                     ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..21.97 rows=441 width=8)
                                                                           Index Cond: ((label)::text = 'E'::text)
                                                   ->  Hash  (cost=242.01..242.01 rows=2500 width=8)
                                                         ->  HashAggregate  (cost=217.01..242.01 rows=2500 width=8)
                                                               ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..204.04 rows=5188 width=8)
                                                                     Index Cond: ((label)::text = 'B'::text)
                                             ->  Index Scan using label_list_user_id_idx on label_list  (cost=0.00..0.34 rows=1 width=8)
                                                   Index Cond: (user_id = public.label_list.user_id)
                                                   Filter: ((label)::text = 'A'::text)
                                       ->  Index Only Scan using label_list_user_id_idx on label_list  (cost=0.00..0.33 rows=2 width=8)
                                             Index Cond: (user_id = public.label_list.user_id)
                                 ->  Index Only Scan using label_list_user_id_idx on label_list  (cost=0.00..0.33 rows=2 width=8)
                                       Index Cond: (user_id = public.label_list.user_id)
(32 行)

クエリ2-2

                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=6787.71..6849.12 rows=6141 width=8)
   ->  Hash Semi Join  (cost=6403.91..6772.36 rows=6141 width=8)
         Hash Cond: (u.id = public.label_list.user_id)
         ->  Seq Scan on user_master u  (cost=0.00..164.00 rows=10000 width=8)
         ->  Hash  (cost=6327.15..6327.15 rows=6141 width=72)
               ->  Hash Join  (cost=4141.44..6327.15 rows=6141 width=72)
                     Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                     ->  Hash Join  (cost=3670.68..5458.41 rows=53850 width=56)
                           Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                           ->  Hash Join  (cost=3048.05..3818.39 rows=25541 width=48)
                                 Hash Cond: (l0.user_id = public.label_list.user_id)
                                 ->  Seq Scan on label_list l0  (cost=0.00..353.39 rows=21539 width=8)
                                 ->  Hash  (cost=2896.62..2896.62 rows=12114 width=40)
                                       ->  Hash Join  (cost=2287.46..2896.62 rows=12114 width=40)
                                             Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                             ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                             ->  Hash  (cost=2215.64..2215.64 rows=5746 width=32)
                                                   ->  Hash Join  (cost=1683.63..2215.64 rows=5746 width=32)
                                                         Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                         ->  Seq Scan on label_list  (cost=90.75..497.99 rows=10770 width=8)
                                                               Filter: (NOT (hashed SubPlan 2))
                                                               SubPlan 2
                                                                 ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..85.57 rows=2075 width=8)
                                                                       Index Cond: ((label)::text = 'D'::text)
                                                         ->  Hash  (cost=1524.75..1524.75 rows=5450 width=24)
                                                               ->  Hash Join  (cost=962.95..1524.75 rows=5450 width=24)
                                                                     Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                                     ->  Hash Join  (cost=420.70..908.14 rows=5450 width=16)
                                                                           Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                                           ->  Seq Scan on label_list  (cost=147.43..554.67 rows=10770 width=8)
                                                                                 Filter: (NOT (hashed SubPlan 1))
                                                                                 SubPlan 1
                                                                                   ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..138.79 rows=3459 width=8)
                                                                                         Index Cond: ((label)::text = 'C'::text)
                                                                           ->  Hash  (cost=242.01..242.01 rows=2500 width=8)
                                                                                 ->  HashAggregate  (cost=217.01..242.01 rows=2500 width=8)
                                                                                       ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..204.04 rows=5188 width=8)
                                                                                             Index Cond: ((label)::text = 'B'::text)
                                                                     ->  Hash  (cost=479.76..479.76 rows=4999 width=8)
                                                                           ->  HashAggregate  (cost=429.77..479.76 rows=4999 width=8)
                                                                                 ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..403.83 rows=10376 width=8)
                                                                                       Index Cond: ((label)::text = 'A'::text)
                           ->  Hash  (cost=353.39..353.39 rows=21539 width=8)
                                 ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                     ->  Hash  (cost=456.20..456.20 rows=1165 width=16)
                           ->  Hash Semi Join  (cost=27.48..456.20 rows=1165 width=16)
                                 Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                 ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                 ->  Hash  (cost=21.97..21.97 rows=441 width=8)
                                       ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..21.97 rows=441 width=8)
                                             Index Cond: ((label)::text = 'E'::text)
(51 行)

explainの結果からも分かるように、最終的なコストが、クエリ1-1が「1118416.52」、クエリ1-2が「1247599.82』であるのに対して、クエリ2-1が「2600.26」、クエリ2-2が「6849.12」と激減している。

実のところ、検証環境でクエリを流してみるといずれも1秒以内で返ってくるので差があまり実感できなかったが、この書き換えを組み込んだ実際のクエリで試してみたところ、書き換え前が4秒弱に対して書き換え後が200ミリ秒と約20分の1になっていたので、効果はあるということだろう。問題は、書き換え後のクエリではJOINの嵐になるということである。実際に使う際には、キーワードの数を5個くらいまでに絞るとかそういう制限を設けないときついだろう。

2017-06-04

IntStreamで遊んでみた

Java

暇だったので、Java 8から追加されたjava.util.stream.IntStreamでちょっと遊んでみた。

ちなみに、IntStreamはinterfaceであるが、ソースを見るとstaticメソッドの実装が書いてある。IntStream#ofとか。まぁそれは余談。

以下のようなソースで出力を見ていく。

import java.util.Arrays;
import java.util.stream.IntStream;

public class Main {
    public static void main(String[] args) {
        int[] ary = { 5, 2, 8, 2, 9, 1, 3, 0, 5, 6, 4, 7, };

        System.out.println("--------countTest");
        countTest(ary);
        System.out.println("--------sumTest");
        sumTest(ary);
        System.out.println("--------maxTest");
        maxTest(ary);
        System.out.println("--------minTest");
        minTest(ary);
        System.out.println("--------sortTest");
        sortTest(ary);
        System.out.println("--------1つのstreamで連続処理できるのかテスト");
        test(ary);
        System.out.println("--------パフォーマンスを計ってみる(sumを例に)");
        for (int ct = 1; ct <= 100000000; ct *= 10) {
            performance(ary, ct);
        }
        System.out.println("--------その他");
        othersTest(ary);
        System.out.println("--------classTest");
        classTest(ary);
    }
    private static void countTest(int[] ary) {
        // 配列が手元にある場合に最もポピュラーな手段
        System.out.println("count1=" + ary.length);

        // Streamしか手元にない場合は仕方ない
        IntStream stream = IntStream.of(ary);
        System.out.println("count2=" + stream.count());
    }
    private static void sumTest(int[] ary) {
        // ループ変数を使って要素にアクセス
        int sum1 = 0;
        for (int i = 0; i < ary.length; ++i) {
            sum1 += ary[i];
        }
        System.out.println("sum1=" + sum1);

        // 拡張for文
        int sum2 = 0;
        for (int a : ary) {
            sum2 += a;
        }
        System.out.println("sum2=" + sum2);

        // java.util.Arrays.stream()
        System.out.println("sum3=" + Arrays.stream(ary).sum());

        // java.util.stream.IntStream
        System.out.println("sum4=" + IntStream.of(ary).sum());
    }
    private static void maxTest(int[] ary) {
        // ループ変数を使って要素にアクセス
        int max1 = Integer.MIN_VALUE;
        for (int i = 0; i < ary.length; ++i) {
            max1 = Math.max(max1, ary[i]);
        }
        System.out.println("max1=" + max1);

        // 拡張for文
        int max2 = Integer.MIN_VALUE;
        for (int a : ary) {
            max2 = Math.max(max2, a);
        }
        System.out.println("max2=" + max2);

        // java.util.Arrays.stream()
        System.out.println("max3=" + Arrays.stream(ary).max());

        // java.util.stream.IntStream
        System.out.println("max4=" + IntStream.of(ary).max());
    }
    private static void minTest(int[] ary) {
        // ループ変数を使って要素にアクセス
        int min1 = Integer.MAX_VALUE;
        for (int i = 0; i < ary.length; ++i) {
            min1 = Math.min(min1, ary[i]);
        }
        System.out.println("min1=" + min1);

        // 拡張for文
        int min2 = Integer.MAX_VALUE;
        for (int a : ary) {
            min2 = Math.min(min2, a);
        }
        System.out.println("min2=" + min2);

        // java.util.Arrays.stream()
        System.out.println("min3=" + Arrays.stream(ary).min());

        // java.util.stream.IntStream
        System.out.println("min4=" + IntStream.of(ary).min());
    }
    private static void sortTest(int[] ary) {
        // java.util.Arraysのsortメソッド(副作用あり)
        int[] ary1 = ary.clone();
        Arrays.sort(ary1);
        System.out.println("sort1=" + Arrays.toString(ary1));

        // java.util.stream.IntStreamのソート済みstreamを取得するsortedメソッド(副作用無し)
        IntStream stream2 = IntStream.of(ary).sorted();
        System.out.println("sort2=" + Arrays.toString(stream2.toArray()));
        System.out.println("orig =" + Arrays.toString(ary));
    }
    private static void test(int[] ary) {
        try {
            IntStream stream = IntStream.of(ary);
            System.out.println("sumT=" + stream.sum());
            System.out.println("maxT=" + stream.max());
            System.out.println("minT=" + stream.min());
            System.out.println("sortT=" + Arrays.toString(stream.sorted().toArray()));
        } catch (RuntimeException e) {
            e.printStackTrace();
        }
    }
    private static void performance(int[] ary, int loopCount) {
        // ループ変数を使って要素にアクセス
        long S1 = System.currentTimeMillis();
        for (int ct = 0; ct < loopCount; ++ct) {
            int sum1 = 0;
            for (int i = 0; i < ary.length; ++i) {
                sum1 += ary[i];
            }
        }
        long G1 = System.currentTimeMillis();

        // 拡張for文
        long S2 = System.currentTimeMillis();
        for (int ct = 0; ct < loopCount; ++ct) {
            int sum2 = 0;
            for (int a : ary) {
                sum2 += a;
            }
        }
        long G2 = System.currentTimeMillis();

        // java.util.Arrays.stream()
        long S3 = System.currentTimeMillis();
        for (int ct = 0; ct < loopCount; ++ct) {
            int sum3 = Arrays.stream(ary).sum();
        }
        long G3 = System.currentTimeMillis();

        // java.util.stream.IntStream
        long S4 = System.currentTimeMillis();
        for (int ct = 0; ct < loopCount; ++ct) {
            int sum4 = IntStream.of(ary).sum();
        }
        long G4 = System.currentTimeMillis();

        System.out.printf("sum(ms) %16dct | %5d | %5d | %5d | %5d |%n",
            loopCount, G1 - S1, G2 - S2, G3 - S3, G4 - S4);
    }
    private static void othersTest(int[] ary) {
        System.out.println("orig    =" + Arrays.toString(ary));
        // distinct
        IntStream stream1 = IntStream.of(ary).distinct();
        System.out.println("distinct=" + Arrays.toString(stream1.toArray()));
        // skip
        IntStream stream2 = IntStream.of(ary).skip(1);
        System.out.println("skip    =" + Arrays.toString(stream2.toArray()));
        // limit
        IntStream stream3 = IntStream.of(ary).limit(5);
        System.out.println("limit   =" + Arrays.toString(stream3.toArray()));
    }
    private static void classTest(int[] ary) {
        IntStream stream;

        stream = IntStream.empty();
        System.out.println("empty     =" + stream.getClass().getName());
        stream = IntStream.range(1, 2);
        System.out.println("range     =" + stream.getClass().getName());
        stream = IntStream.concat(IntStream.empty(), IntStream.empty());
        System.out.println("concat    =" + stream.getClass().getName());
        stream = IntStream.of(ary);
        System.out.println("of        =" + stream.getClass().getName());
        stream = IntStream.of(ary).sorted();
        System.out.println("sorted    =" + stream.getClass().getName());
        stream = IntStream.of(ary).filter(i -> true);
        System.out.println("filter    =" + stream.getClass().getName());
        stream = IntStream.of(ary).map(i -> i);
        System.out.println("map       =" + stream.getClass().getName());
        stream = IntStream.of(ary).distinct();
        System.out.println("distinct  =" + stream.getClass().getName());
        stream = IntStream.of(ary).skip(1);
        System.out.println("skip      =" + stream.getClass().getName());
        stream = IntStream.of(ary).limit(1);
        System.out.println("limit     =" + stream.getClass().getName());
    }
}

各々の出力結果

countTest

--------countTest
count1=12
count2=12

int配列が手元にあるなら要らないだろうというメソッドcount()。

Streamしか手元にないなら仕方ないだろうが、Streamの要素を数え上げるという処理によって要素が消費されてその後は何もできなくなる(後述)ので、要素数を取得してStreamを捨ててもよいという場合にしか使えない。

sumTest

--------sumTest
sum1=52
sum2=52
sum3=52
sum4=52

これは、配列の要素のsumを計算するのに1行で書けるのでちょっとうれしいかも。

ちなみに、IntStream.of(int...)は内部でArrays.stream(int[])を呼んでいる。

maxTest

--------maxTest
max1=9
max2=9
max3=OptionalInt[9]
max4=OptionalInt[9]

これは、OptionalIntが返ってくるのがちょっと微妙なところ。max値がInteger.MIN_VALUEかどうかと比較するのとどっちがいいかという感じ。

minTest

--------minTest
min1=0
min2=0
min3=OptionalInt[0]
min4=OptionalInt[0]

maxと同じく。

sortTest

--------sortTest
sort1=[0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9]
sort2=[0, 1, 2, 2, 3, 4, 5, 5, 6, 7, 8, 9]
orig =[5, 2, 8, 2, 9, 1, 3, 0, 5, 6, 4, 7]

元の配列への副作用を発生させないために、ary.clone()するかIntStream#sortedを使うかという感じ。ソートされたStreamを取得して更にあれこれやりたい場合には有用。

test

--------1つのstreamで連続処理できるのかテスト
sumT=52
java.lang.IllegalStateException: stream has already been operated upon or closed
        at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:229)
        at java.util.stream.IntPipeline.reduce(IntPipeline.java:461)
        at java.util.stream.IntPipeline.max(IntPipeline.java:424)
        at Main.test(Main.java:111)
        at Main.main(Main.java:19)

これはまぁ予想通り。java.io.InputStreamやjava.io.Readerからの読み込みを(一部を除いて)巻き戻し再実行できないのと同じ。

(補足)sum、max、min、average、countが欲しいだけならsummaryStatistics()メソッドを使えばいいらしい。

performance

--------パフォーマンスを計ってみる(sumを例に)
sum(ms)                1ct |     0 |     0 |     0 |     0 |
sum(ms)               10ct |     0 |     0 |     0 |     0 |
sum(ms)              100ct |     0 |     0 |     1 |     0 |
sum(ms)             1000ct |     0 |     0 |     9 |     6 |
sum(ms)            10000ct |     2 |     2 |     4 |    16 |
sum(ms)           100000ct |     5 |    28 |     8 |    18 |
sum(ms)          1000000ct |    75 |    66 |   102 |    53 |
sum(ms)         10000000ct |     0 |     0 |   649 |   607 |
sum(ms)        100000000ct |     0 |     0 |  6118 |  6154 |

まぁ、IntStreamオブジェクトを生成するから仕方ない。

othersTest

--------その他
orig    =[5, 2, 8, 2, 9, 1, 3, 0, 5, 6, 4, 7]
distinct=[5, 2, 8, 9, 1, 3, 0, 6, 4, 7]
skip    =[2, 8, 2, 9, 1, 3, 0, 5, 6, 4, 7]
limit   =[5, 2, 8, 2, 9]

distinctは重複要素を除去した新たなStreamを生成する。

skipは先頭の指定された要素数をスキップした新たなStreamを生成する。

limitは先頭の指定された要素数だけを返す新たなStreamを生成する。

その他にfilterとかmapとか色々あるが割愛(謎)。lambda式から逃げてるだけとも言わなくもない(ぉ‥

classTest

--------classTest
empty     =java.util.stream.IntPipeline$Head
range     =java.util.stream.IntPipeline$Head
concat    =java.util.stream.IntPipeline$Head
of        =java.util.stream.IntPipeline$Head
sorted    =java.util.stream.SortedOps$OfInt
filter    =java.util.stream.IntPipeline$9
map       =java.util.stream.IntPipeline$3
distinct  =java.util.stream.ReferencePipeline$4
skip      =java.util.stream.SliceOps$2
limit     =java.util.stream.SliceOps$2

java.util.stream.IntStreamがインタフェースだということで、実装クラスがどうなっているのか気になったので列挙してみた。

2017-04-23

PHPの三項演算子の注意すべき挙動

PHP Python Ruby C C++ Java

PHPで以下のようなコードを書いていてしばらくはまっていたのでメモ。

<?php

function a($a) {
    printf("%s\n",
        isset($a["A"]["B"]) ?
            $a["A"]["B"] :
            isset($a["AA"]["BB"]) ?
                $a["AA"]["BB"] :
                "CCC");
}

$a = array();
$a["A"]["B"] = "C";
a($a); // "C" ?

$a = array();
$a["AA"]["BB"] = "CC";
a($a); // "CC"

$a = array();
a($a); // "CCC"

実行結果は以下の通り。

PHP Notice:  Undefined index: AA in boo.php on line 8

CC
CCC

8行目でNoticeが出ることがある。8行目は

                $a["AA"]["BB"] :

の部分‥7行目でissetでチェックしているはずだが‥

ググってみると、以下のようなものが見つかった。

PHP: 比較演算子 - Manual

曰く

注意:
三項演算子を “積み重ねて” 使用することは避けましょう。ひとつの文の中で複数の三項演算子を使用した際の PHP の振る舞いは、少々わかりにくいものです。

例として挙がっているのが以下のコード。

<?php
echo (true?'true':false?'t':'f');

これはPHPでは

<?php
echo((true ? 'true' : false) ? 't' : 'f');

と評価され、’t' が出力されると。

‥ってこんなん分かるかいっ‥ってことで、各言語で試してみた。

PHP

CentOS 7のPHP 5.4.16

<?php
printf("%s\n", (true ? 'true' : false ? 't' : 'f'));
printf("%s\n", (false ? 'false' : true ? 't' : 'f'));
printf("%s\n", (false ? 'false' : false ? 't' : 'f'));

実行結果

t
t
f

Java

CentOS 7のOpenJDK 1.8.0_121

public class Hoge {
    public static void main(String[] args) {
        System.out.println(true ? "true" : false ? "t" : "f");
        System.out.println(false ? "false" : true ? "t" : "f");
        System.out.println(false ? "false" : false ? "t" : "f");
    }
}

実行結果

true
t
f

C

CentOS 7のgcc バージョン 4.8.5

#include <stdio.h>
#include <stdbool.h>

void main(int argc, char** argv) {
    printf("%s\n", true ? "true" : false ? "t" : "f");
    printf("%s\n", false ? "false" : true ? "t" : "f");
    printf("%s\n", false ? "false" : false ? "t" : "f");
}

実行結果

true
t
f

C++

CentOS 7のgcc バージョン 4.8.5

#include <iostream>

using namespace std;

int main(int argc, char** argv) {
    cout << (true ? "true" : false ? "t" : "f") << endl;
    cout << ("%s\n", false ? "false" : true ? "t" : "f") << endl;
    cout << ("%s\n", false ? "false" : false ? "t" : "f") << endl;
    return 0;
}

実行結果

true
t
f

Ruby

CentOS 7のruby 2.0.0p648

printf("%s\n", true ? "true" : false ? "t" : "f");
printf("%s\n", false ? "false" : true ? "t" : "f");
printf("%s\n", false ? "false" : false ? "t" : "f");

実行結果

true
t
f

Python

CentOS 7のPython 2.7.5

print('true' if True else 't' if False else 'f');
print('false' if False else 't' if True else 'f');
print('false' if False else 't' if False else 'f');

実行結果

true
t
f

参考

pythonで三項演算子のネスト - Qiita

結局‥

冒頭のコードは

<?php

function a($a) {
    printf("%s\n",
        isset($a["A"]["B"]) ?
            $a["A"]["B"] :
            (isset($a["AA"]["BB"]) ?
                $a["AA"]["BB"] :
                "CCC"));
}

$a = array();
$a["A"]["B"] = "C";
a($a); // "C"

$a = array();
$a["AA"]["BB"] = "CC";
a($a); // "CC"

$a = array();
a($a); // "CCC"

のように括弧を付けて逃げましたとさ。やれやれ。

2017-02-27

Apache TikaのPDFファイルテキスト抽出で遊んでみる

Apache Tika Java

今更ながらに「Apache Tika」というものの存在を知る‥PDFとかその他諸々のファイルのメタデータやテキストを抽出してくれる‥｜Apache Tika https://t.co/CsCBY74ekK
— HHeLiBeX (@hhelibex) 2017年2月27日

ということで、サイトのParser APIを追いかけてコードを組み立ててみたメモ。

環境は、CentOS 7(VM)上のOpenJDK 1.8.0_111。

PDFファイルからのテキスト抽出

以下のようなPDFファイルを使う。

f:id:hhelibex:20170227225332p:plain

PDFファイルからのテキスト抽出にはorg.apache.tika.parser.pdf.PDFParserクラスを使う。

最低限のコードは以下のような感じ。

import java.io.*;
import java.util.*;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class ParserSample {
    public static void main(String[] args) {
        Parser parser = new PDFParser();

        for (String arg : args) {
            ContentHandler handler = new BodyContentHandler();
            try (InputStream in = new FileInputStream(arg)) {
                Metadata metadata = new Metadata();
                ParseContext context = new ParseContext();
                try {
                    parser.parse(in, handler, metadata, context);
                } catch (TikaException | IOException | SAXException e) {
                    e.printStackTrace();
                }

                List<String> list = new ArrayList<>(Arrays.asList(handler.toString().trim().split("\\s")));
                List<String> names = new ArrayList<>(Arrays.asList(metadata.names()));
                Collections.sort(names);
                for (String name : names) {
                    System.out.printf("%-48s = %s%n", name, metadata.get(name));
                }
                System.out.println();
                System.out.println(list);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

これを実行すると、出力として以下のようなものが得られる。

Author                                           = hhelibex
Content-Type                                     = application/pdf
Creation-Date                                    = 2017-02-27T11:04:23Z
Last-Modified                                    = 2017-02-27T11:04:23Z
Last-Save-Date                                   = 2017-02-27T11:04:23Z
access_permission:assemble_document              = true
access_permission:can_modify                     = true
access_permission:can_print                      = true
access_permission:can_print_degraded             = true
access_permission:extract_content                = true
access_permission:extract_for_accessibility      = true
access_permission:fill_in_form                   = true
access_permission:modify_annotations             = true
created                                          = Mon Feb 27 20:04:23 JST 2017
creator                                          = hhelibex
date                                             = 2017-02-27T11:04:23Z
dc:creator                                       = hhelibex
dc:format                                        = application/pdf; version=1.4
dc:title                                         = Apache Tikaのテスト用
dcterms:created                                  = 2017-02-27T11:04:23Z
dcterms:modified                                 = 2017-02-27T11:04:23Z
meta:author                                      = hhelibex
meta:creation-date                               = 2017-02-27T11:04:23Z
meta:save-date                                   = 2017-02-27T11:04:23Z
modified                                         = 2017-02-27T11:04:23Z
pdf:PDFVersion                                   = 1.4
pdf:docinfo:created                              = 2017-02-27T11:04:23Z
pdf:docinfo:creator                              = hhelibex
pdf:docinfo:creator_tool                         = MicrosoftR Office ExcelR 2007
pdf:docinfo:modified                             = 2017-02-27T11:04:23Z
pdf:docinfo:producer                             = MicrosoftR Office ExcelR 2007
pdf:docinfo:title                                = Apache Tikaのテスト用
pdf:encrypted                                    = false
pdfa:PDFVersion                                  = A-1b
pdfaid:conformance                               = B
pdfaid:part                                      = 1
producer                                         = MicrosoftR Office ExcelR 2007
title                                            = Apache Tikaのテスト用
xmp:CreatorTool                                  = MicrosoftR Office ExcelR 2007
xmpMM:DocumentID                                 = uuid:72AB1A36-B534-4714-90E1-7640E90E7083
xmpTPg:NPages                                    = 1

[ヘッダ１, ヘッダ２, あ, ア, 阿, い, イ, 伊, う, ウ, 宇, え, エ, 江, お, オ, 尾]

PDFファイルだと「pdf:」で始まる名前を使ってMetadataから情報を取ればよい感じか。

コンテンツの方は、handler.toString()で得られる文字列が抽出結果になっているので、あとは煮るなり焼くなりすればいい。

汎用的にしてみる

実は、上記のことはorg.apache.tika.parser.AutoDetectParserを使っても実現できる。

ファイルタイプの検出に少し時間が掛かるのだが、PDFファイルだけではなくほかのファイルのテキストも同じコードで抽出したいという場合には有用だろう。

コードはほぼ同じになるが、以下のような感じ。

import java.io.*;
import java.util.*;

import org.apache.tika.exception.TikaException;
import org.apache.tika.metadata.Metadata;
import org.apache.tika.parser.AutoDetectParser;
import org.apache.tika.parser.ParseContext;
import org.apache.tika.parser.Parser;
//import org.apache.tika.parser.pdf.PDFParser;
import org.apache.tika.sax.BodyContentHandler;

import org.xml.sax.ContentHandler;
import org.xml.sax.SAXException;

public class ParserSample {
    public static void main(String[] args) {
//      Parser parser = new PDFParser();
        Parser parser = new AutoDetectParser();

        for (String arg : args) {
            ContentHandler handler = new BodyContentHandler();
            try (InputStream in = new FileInputStream(arg)) {
                Metadata metadata = new Metadata();
                ParseContext context = new ParseContext();
                try {
                    parser.parse(in, handler, metadata, context);
                } catch (TikaException | IOException | SAXException e) {
                    e.printStackTrace();
                }

                List<String> list = new ArrayList<>(Arrays.asList(handler.toString().trim().split("\\s")));
                List<String> names = new ArrayList<>(Arrays.asList(metadata.names()));
                Collections.sort(names);
                for (String name : names) {
                    System.out.printf("%-48s = %s%n", name, metadata.get(name));
                }
                System.out.println();
                System.out.println(list);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    }
}

試しに、先に使ったPDFファイルの元であるExcelファイルを食わせてみる。

Application-Name                                 = Microsoft Excel
Author                                           = hhelibex
Company                                          = 
Content-Type                                     = application/vnd.ms-excel
Creation-Date                                    = 2017-02-27T10:23:34Z
Last-Author                                      = hhelibex
Last-Modified                                    = 2017-02-27T10:31:13Z
Last-Save-Date                                   = 2017-02-27T10:31:13Z
X-Parsed-By                                      = org.apache.tika.parser.DefaultParser
creator                                          = hhelibex
date                                             = 2017-02-27T10:31:13Z
dc:creator                                       = hhelibex
dc:title                                         = Apache Tikaのテスト用
dcterms:created                                  = 2017-02-27T10:23:34Z
dcterms:modified                                 = 2017-02-27T10:31:13Z
extended-properties:Application                  = Microsoft Excel
extended-properties:Company                      = 
meta:author                                      = hhelibex
meta:creation-date                               = 2017-02-27T10:23:34Z
meta:last-author                                 = hhelibex
meta:save-date                                   = 2017-02-27T10:31:13Z
modified                                         = 2017-02-27T10:31:13Z
title                                            = Apache Tikaのテスト用

[Sheet1, , , ヘッダ１, ヘッダ２, , あ, ア, 阿, , い, イ, 伊, , う, ウ, 宇, , え, エ, 江, , お, オ, 尾, , , Sheet2, , , , , Sheet3]

PDFファイルの場合も同様だが、AutoDetectParserを使うと「X-Parsed-By」がMetadataに追加されている。

2016-12-20

シフト演算とMath.pow(2, n)と（＋Math.pow(2, n)の怪）

Java

自分が時々やらかしてしまうのでメモ。

環境は、CentOS 7(VM)上のOpenJDK 1.8.0_111。

2のべき乗(整数値)が欲しい場合はシフト演算

2のべき乗(整数値)が欲しいときに時々やらかしてしまうのが、以下のようなコードを書いてしまうこと。

for (int i = 0; i < 63; ++i) {
    long tmp = (long)Math.pow(2, i);
}

基数が2以外だったり指数が整数でなかったりdouble値が欲しかったりlongのMAX_VALUEを超える場合だったり、そんな場合は仕方がないのだが、2のべき乗(整数値)が欲しい場合はシフト演算の方が断然速い。

for (int i = 0; i < 63; ++i) {
    long tmp = 1L << i;
}

どれくらい速いのか、実際に計ってみた。

import java.util.Scanner;

public class Main {
    public static void main(String[] args) {
        long S;
        long G;

        int cnt;

        Scanner sc = new Scanner(System.in);
        int n = sc.nextInt();
        int m = sc.nextInt();

        System.out.print(n);

        S = System.currentTimeMillis();
        test1(n, m);
        G = System.currentTimeMillis();
        System.out.print("," + (G - S));

        S = System.currentTimeMillis();
        test2(n, m);
        G = System.currentTimeMillis();
        System.out.print("," + (G - S));

        S = System.currentTimeMillis();
        test3(n, m);
        G = System.currentTimeMillis();
        System.out.print("," + (G - S));

        System.out.println();
    }
    private static void test1(int n, int m) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < 63; ++j) {
                long tmp = 1L << j;
            }
        }
    }
    private static void test2(int n, int m) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < 63; ++j) {
                long tmp = (long)Math.pow(2, j);
            }
        }
    }
    private static void test3(int n, int m) {
        for (int i = 0; i < n; ++i) {
            for (int j = 0; j < 63; ++j) {
                long tmp = (long)Math.pow(2, m);
            }
        }
    }
}

test1が「Math.pow(2, n)」、test2が「シフト演算」を使ったケース。

test3は後で使うので後述。

実行は以下のような感じ。

for i in 100 250 500 1000 2500 5000 10000 25000 50000 100000 250000 500000 1000000 ; do
    echo $i 62 | java Main
done

実行結果のうち、test1とtest2を抜き出したものは以下の通り。単位はミリ秒。

n	test1	test2
100	0	1
250	0	2
500	0	3
1000	1	5
2500	3	16
5000	5	30
10000	9	54
25000	10	121
50000	12	233
100000	13	458
250000	21	1142
500000	31	2260
1000000	54	4513

f:id:hhelibex:20161220232156p:plain

全然違う。

Math.pow(2, n)の怪

検証している最中に妙なことに気付いたのでついでにメモ。

上記プログラムのtest1とtest3を使う。

test3のポイントは、「Math.pow(2, m)」とループの最中に変わることがない値を指数に指定していること。

実行結果のうち、test1とtest3を抜き出してみる。単位はミリ秒。

n	test1	test3
100	0	0
250	0	1
500	0	3
1000	1	8
2500	3	18
5000	5	31
10000	9	37
25000	10	37
50000	12	40
100000	13	45
250000	21	59
500000	31	83
1000000	54	130

f:id:hhelibex:20161220232701p:plain

それでもシフト演算の方が速いのだが、test2との違いは何だろう‥ソースを追おうにもnativeメソッドだし‥謎だ‥

2016-12-02

rangeによるrangeの違い

PHP Python Java Ruby

Pythonのコードをちゃんと書いたことはないのだけど、ある理由でPythonのコードを読んでいて、ふと気になって調べたメモ。

結論から言うと、rangeで範囲を指定したときに列挙される値が、プログラム言語や、同じ言語でも書き方によって違うのだな、と。

実行環境

CentOS 7
- Python: 2.7.5
- PHP: 5.4.16
- Java: OpenJDK 1.8.0_111
- Ruby: 2.0.0p598

Python

以下のコードを実行する。

range.py

for i in range(1, 2):
    print(i)

実行結果。

つまり、for (i = 1; i < 2; ++i)(疑似コード)って書くのと一緒。

PHP

以下のコードを実行する。

range.php

<?php
foreach (range(1, 2) as $i) {
    printf("%d\n", $i);
}

実行結果。

1
2

つまり、for (i = 1; i <= 2; ++i)(疑似コード)って書くのと一緒。

Java

ついでなので調べてみたら、Java 8からjava.util.stream.IntStreamというのが追加されていた。

IntStreamには、range(int, int)とrangeClosed(int, int)の2種類がある。

forEachメソッドのところは適当にググって真似てみた(謎)。

Range.java

import java.util.stream.IntStream;

public class Range {
    public static void main(String[] args) {
        IntStream.range(1, 2).forEach(i -> {
            System.out.println(i);
        });
    }
}

実行結果。

こっちはPython式。

RangeClosed.java

import java.util.stream.IntStream;

public class RangeClosed {
    public static void main(String[] args) {
        IntStream.rangeClosed(1, 2).forEach(i -> {
            System.out.println(i);
        });
    }
}

実行結果。

1
2

こっちはPHP式。

Ruby

さらについで。Rubyのプログラムもちゃんと書いたことはないのでググって調べてみた。

range.rb

for i in 1...2 do
    print(i, "\n")
end

または、

for i in Range.new(1, 2, true) do
    print(i, "\n")
end

実行結果。

Python式。

rangeClosed.rb

for i in 1..2 do
    print(i, "\n")
end

または、

for i in Range.new(1, 2) do
    print(i, "\n")
end

実行結果。

1
2

PHP式。

まとめると・・・

言語	プログラムの記述例	実際の範囲
Python	`for i in range(1, 100):`	1 <= i < 100
PHP	`foreach (range(1, 100) as $i)`	1 <= $i <= 100
Java	`IntStream.range(1, 100).forEach(i (略)`	1 <= i < 100
	`IntStream.rangeClosed(1, 100).forEach(i (略)`	1 <= i <= 100
Ruby	`for i in 1..100 do`	1 <= i <= 100
	`for i in Range.new(1, 100) do`	1 <= i <= 100
	`for i in 1...100 do`	1 <= i < 100
	`for i in Range.new(1, 100, true) do`	1 <= i < 100