HHeLiBeXの日記 正道編

日々の記憶の記録とメモ‥

ARRAY_TO_STRINGの結果に対してLIKE検索している部分が遅い問題の一つの解決案

割と複雑なクエリで4秒とか掛かるものにぶち当たり、explainを取ってみたところ、表題の通りARRAY_TO_STRINGした結果得られる文字列に対するLIKE検索をしている部分でコストが増大していることが分かり、軽く検証してみたメモ。

細かいところは追い追い書いていくとして、まずは準備。

環境

準備

以下のようなクエリを流して、テーブルおよびデータを作る。結構時間が掛かるので、実行する際には心してお待ちください。

#! /bin/bash

psql -U postgres -q -c 'DROP DATABASE IF EXISTS test_db'
psql -U postgres -q -c 'CREATE DATABASE test_db'
psql -U postgres -q -c 'CREATE TABLE IF NOT EXISTS user_master (
                              id BIGSERIAL PRIMARY KEY
                            , name VARCHAR(128) NOT NULL
                        )' test_db
psql -U postgres -q -c 'CREATE TABLE IF NOT EXISTS label_list (
                              id BIGSERIAL PRIMARY KEY
                            , user_id BIGINT NOT NULL
                            , label VARCHAR(128) NOT NULL
                        )' test_db
psql -U postgres -q -c 'CREATE INDEX ON label_list(user_id)' test_db
psql -U postgres -q -c 'CREATE INDEX ON label_list(label)' test_db

# 全ユーザ
for ((i = 1; i <= 10000; ++i)); do
        psql -U postgres -q -c "INSERT INTO user_master(name) VALUES('name$(printf %06d ${i})')" test_db
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'A')" test_db
done
# 奇数IDのユーザ
for ((i = 1; i <= 10000; i += 2)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'B')" test_db
done
# 3の倍数+1のIDのユーザ
for ((i = 1; i <= 10000; i += 3)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'C')" test_db
done
# 5の倍数+1のIDのユーザ
for ((i = 1; i <= 10000; i += 5)); do
        psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'D')" test_db
done
# 素数IDのユーザ
for ((i = 1; i <= 10000; ++i)); do
        if [ $(factor ${i} | wc -w) -eq 2 ]; then
                psql -U postgres -q -c "INSERT INTO label_list(user_id, label) VALUES(${i}, 'E')" test_db
        fi
done

説明は・・要るかな・・一応軽く書くと、user_masterテーブルに各ユーザの情報が格納されており、各ユーザに対して付けられているいくつかのラベルがlabel_listテーブルに1ラベル1行として格納されている、という構造。

クエリ1(書き換え前)

オリジナルのクエリからARRAY_TO_STRINGを使って検索している部分を抜き出したもの。

  • クエリ1-1
SELECT DISTINCT u.id
    FROM user_master u
    WHERE u.id IN (
        SELECT user_id
            FROM (
                SELECT DISTINCT
                      user_id
                    , ARRAY_TO_STRING(
                          ARRAY(
                            SELECT label
                                FROM label_list l2
                                WHERE l1.user_id = l2.user_id
                          )
                        , ',') label_str
                    FROM label_list l1
            ) l0
            WHERE label_str LIKE '%A%'
              AND label_str LIKE '%B%'
              AND label_str LIKE '%E%'
)
  • クエリ1-2
SELECT DISTINCT u.id
    FROM user_master u
    WHERE u.id IN (
        SELECT user_id
            FROM (
                SELECT DISTINCT
                      user_id
                    , ARRAY_TO_STRING(
                          ARRAY(
                            SELECT label
                                FROM label_list l2
                                WHERE l1.user_id = l2.user_id
                          )
                        , ',') label_str
                    FROM label_list l1
            ) l0
            WHERE label_str LIKE '%A%'
              AND label_str LIKE '%B%'
              AND label_str NOT LIKE '%C%'
              AND label_str NOT LIKE '%D%'
              AND label_str LIKE '%E%'
)

いずれも、ARRAY_TO_STRINGであるユーザにつけられているラベルをカンマ区切りの1つの文字列にしてLIKE検索をしている。

2つのクエリの違いはどのラベルを含むか含まないかの条件の数だけ。

クエリ2(書き換え後)

ARRAY_TO_STRINGを使わないようにするには、検索したいキーワードの数だけlabel_listをキーワードで絞り込んでJOINしていくしかない。

  • クエリ2-1
SELECT DISTINCT u.id, u.name
    FROM user_master u
    WHERE u.id IN (
        SELECT l0.user_id
            FROM label_list l0
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'A'
                    )
            ) l1
            ON l0.user_id = l1.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'B'
                    )
            ) l2
            ON l0.user_id = l2.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'E'
                    )
            ) l3
            ON l0.user_id = l3.user_id
)
  • クエリ2-2
SELECT DISTINCT u.id, u.name
    FROM user_master u
    WHERE u.id IN (
        SELECT l0.user_id
            FROM label_list l0
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'A'
                    )
            ) l1
            ON l0.user_id = l1.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'B'
                    )
            ) l2
            ON l0.user_id = l2.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id NOT IN (
                        SELECT user_id FROM label_list WHERE label = 'C'
                    )
            ) l3
            ON l0.user_id = l3.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id NOT IN (
                        SELECT user_id FROM label_list WHERE label = 'D'
                    )
            ) l4
            ON l0.user_id = l4.user_id
            JOIN (
                SELECT user_id
                    FROM label_list
                    WHERE user_id IN (
                        SELECT user_id FROM label_list WHERE label = 'E'
                    )
            ) l5
            ON l0.user_id = l5.user_id
)

実行結果

全部を載せるのは無理があるので、件数だけ載せておく。

$ cat query1.sql query2.sql | psql -U postgres test_db | grep '^[(]'
(1228)
(463)
(1228)
(463)

実際には、両方のクエリで結果が同じになるということは検証してある。

explainの結果

最後にexplainを取った結果を載せておく。

  • クエリ1-1
                                                                                                          QUERY PLAN                                                                                                          
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1118038.76..1118416.52 rows=5000 width=8)
   ->  Merge Join  (cost=1118038.76..1118404.02 rows=5000 width=8)
         Merge Cond: (u.id = l1.user_id)
         ->  Index Only Scan using user_master_pkey on user_master u  (cost=0.00..337.25 rows=10000 width=8)
         ->  Sort  (cost=1118038.76..1118039.26 rows=200 width=8)
               Sort Key: l1.user_id
               ->  HashAggregate  (cost=1118029.12..1118031.12 rows=200 width=8)
                     ->  HashAggregate  (cost=997726.56..1117899.45 rows=10374 width=8)
                           ->  Seq Scan on label_list l1  (cost=0.00..997618.90 rows=21533 width=8)
                                 Filter: ((array_to_string((SubPlan 2), ','::text) ~~ '%A%'::text) AND (array_to_string((SubPlan 3), ','::text) ~~ '%B%'::text) AND (array_to_string((SubPlan 4), ','::text) ~~ '%E%'::text))
                                 SubPlan 1
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 2
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 3
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 4
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
(30 行)
  • クエリ1-2
                                                                                                                                                                        QUERY PLAN                                                                                                                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=1247100.97..1247599.82 rows=1 width=8)
   ->  Nested Loop Semi Join  (cost=1247100.97..1247599.82 rows=1 width=8)
         Join Filter: (u.id = l0.user_id)
         ->  Index Only Scan using user_master_pkey on user_master u  (cost=0.00..337.25 rows=10000 width=8)
         ->  Materialize  (cost=1247100.97..1247112.57 rows=1 width=8)
               ->  Subquery Scan on l0  (cost=1247100.97..1247112.56 rows=1 width=8)
                     ->  HashAggregate  (cost=1247100.97..1247112.55 rows=1 width=8)
                           ->  Seq Scan on label_list l1  (cost=0.00..1247100.97 rows=1 width=8)
                                 Filter: ((array_to_string((SubPlan 2), ','::text) ~~ '%A%'::text) AND (array_to_string((SubPlan 3), ','::text) ~~ '%B%'::text) AND (array_to_string((SubPlan 4), ','::text) !~~ '%C%'::text) AND (array_to_string((SubPlan 5), ','::text) !~~ '%D%'::text) AND (array_to_string((SubPlan 6), ','::text) ~~ '%E%'::text))
                                 SubPlan 1
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 2
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 3
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 4
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 5
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
                                 SubPlan 6
                                   ->  Bitmap Heap Scan on label_list l2  (cost=4.27..11.57 rows=2 width=2)
                                         Recheck Cond: (l1.user_id = user_id)
                                         ->  Bitmap Index Scan on label_list_user_id_idx  (cost=0.00..4.27 rows=2 width=0)
                                               Index Cond: (l1.user_id = user_id)
(39 行)
  • クエリ2-1
                                                                            QUERY PLAN                                                                            
------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=2545.00..2600.26 rows=5526 width=8)
   ->  Hash Semi Join  (cost=2194.33..2531.18 rows=5526 width=8)
         Hash Cond: (u.id = public.label_list.user_id)
         ->  Seq Scan on user_master u  (cost=0.00..164.00 rows=10000 width=8)
         ->  Hash  (cost=2125.25..2125.25 rows=5526 width=56)
               ->  Hash Join  (cost=1635.83..2125.25 rows=5526 width=56)
                     Hash Cond: (l0.user_id = public.label_list.user_id)
                     ->  Seq Scan on label_list l0  (cost=0.00..353.39 rows=21539 width=8)
                     ->  Hash  (cost=1603.07..1603.07 rows=2621 width=48)
                           ->  Nested Loop  (cost=301.13..1603.07 rows=2621 width=48)
                                 ->  Nested Loop  (cost=301.13..1162.32 rows=1243 width=40)
                                       ->  Nested Loop Semi Join  (cost=301.13..953.11 rows=590 width=32)
                                             ->  Hash Join  (cost=301.13..748.45 rows=590 width=24)
                                                   Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                   ->  Hash Join  (cost=27.87..466.52 rows=1165 width=16)
                                                         Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                         ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                                         ->  Hash  (cost=25.20..25.20 rows=213 width=8)
                                                               ->  HashAggregate  (cost=23.07..25.20 rows=213 width=8)
                                                                     ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..21.97 rows=441 width=8)
                                                                           Index Cond: ((label)::text = 'E'::text)
                                                   ->  Hash  (cost=242.01..242.01 rows=2500 width=8)
                                                         ->  HashAggregate  (cost=217.01..242.01 rows=2500 width=8)
                                                               ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..204.04 rows=5188 width=8)
                                                                     Index Cond: ((label)::text = 'B'::text)
                                             ->  Index Scan using label_list_user_id_idx on label_list  (cost=0.00..0.34 rows=1 width=8)
                                                   Index Cond: (user_id = public.label_list.user_id)
                                                   Filter: ((label)::text = 'A'::text)
                                       ->  Index Only Scan using label_list_user_id_idx on label_list  (cost=0.00..0.33 rows=2 width=8)
                                             Index Cond: (user_id = public.label_list.user_id)
                                 ->  Index Only Scan using label_list_user_id_idx on label_list  (cost=0.00..0.33 rows=2 width=8)
                                       Index Cond: (user_id = public.label_list.user_id)
(32 行)
  • クエリ2-2
                                                                                      QUERY PLAN                                                                                      
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 HashAggregate  (cost=6787.71..6849.12 rows=6141 width=8)
   ->  Hash Semi Join  (cost=6403.91..6772.36 rows=6141 width=8)
         Hash Cond: (u.id = public.label_list.user_id)
         ->  Seq Scan on user_master u  (cost=0.00..164.00 rows=10000 width=8)
         ->  Hash  (cost=6327.15..6327.15 rows=6141 width=72)
               ->  Hash Join  (cost=4141.44..6327.15 rows=6141 width=72)
                     Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                     ->  Hash Join  (cost=3670.68..5458.41 rows=53850 width=56)
                           Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                           ->  Hash Join  (cost=3048.05..3818.39 rows=25541 width=48)
                                 Hash Cond: (l0.user_id = public.label_list.user_id)
                                 ->  Seq Scan on label_list l0  (cost=0.00..353.39 rows=21539 width=8)
                                 ->  Hash  (cost=2896.62..2896.62 rows=12114 width=40)
                                       ->  Hash Join  (cost=2287.46..2896.62 rows=12114 width=40)
                                             Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                             ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                             ->  Hash  (cost=2215.64..2215.64 rows=5746 width=32)
                                                   ->  Hash Join  (cost=1683.63..2215.64 rows=5746 width=32)
                                                         Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                         ->  Seq Scan on label_list  (cost=90.75..497.99 rows=10770 width=8)
                                                               Filter: (NOT (hashed SubPlan 2))
                                                               SubPlan 2
                                                                 ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..85.57 rows=2075 width=8)
                                                                       Index Cond: ((label)::text = 'D'::text)
                                                         ->  Hash  (cost=1524.75..1524.75 rows=5450 width=24)
                                                               ->  Hash Join  (cost=962.95..1524.75 rows=5450 width=24)
                                                                     Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                                     ->  Hash Join  (cost=420.70..908.14 rows=5450 width=16)
                                                                           Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                                                           ->  Seq Scan on label_list  (cost=147.43..554.67 rows=10770 width=8)
                                                                                 Filter: (NOT (hashed SubPlan 1))
                                                                                 SubPlan 1
                                                                                   ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..138.79 rows=3459 width=8)
                                                                                         Index Cond: ((label)::text = 'C'::text)
                                                                           ->  Hash  (cost=242.01..242.01 rows=2500 width=8)
                                                                                 ->  HashAggregate  (cost=217.01..242.01 rows=2500 width=8)
                                                                                       ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..204.04 rows=5188 width=8)
                                                                                             Index Cond: ((label)::text = 'B'::text)
                                                                     ->  Hash  (cost=479.76..479.76 rows=4999 width=8)
                                                                           ->  HashAggregate  (cost=429.77..479.76 rows=4999 width=8)
                                                                                 ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..403.83 rows=10376 width=8)
                                                                                       Index Cond: ((label)::text = 'A'::text)
                           ->  Hash  (cost=353.39..353.39 rows=21539 width=8)
                                 ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                     ->  Hash  (cost=456.20..456.20 rows=1165 width=16)
                           ->  Hash Semi Join  (cost=27.48..456.20 rows=1165 width=16)
                                 Hash Cond: (public.label_list.user_id = public.label_list.user_id)
                                 ->  Seq Scan on label_list  (cost=0.00..353.39 rows=21539 width=8)
                                 ->  Hash  (cost=21.97..21.97 rows=441 width=8)
                                       ->  Index Scan using label_list_label_idx on label_list  (cost=0.00..21.97 rows=441 width=8)
                                             Index Cond: ((label)::text = 'E'::text)
(51 行)

explainの結果からも分かるように、最終的なコストが、クエリ1-1が「1118416.52」、クエリ1-2が「1247599.82』であるのに対して、クエリ2-1が「2600.26」、クエリ2-2が「6849.12」と激減している。

実のところ、検証環境でクエリを流してみるといずれも1秒以内で返ってくるので差があまり実感できなかったが、この書き換えを組み込んだ実際のクエリで試してみたところ、書き換え前が4秒弱に対して書き換え後が200ミリ秒と約20分の1になっていたので、効果はあるということだろう。問題は、書き換え後のクエリではJOINの嵐になるということである。実際に使う際には、キーワードの数を5個くらいまでに絞るとかそういう制限を設けないときついだろう。