MySQL 8.0 hash join

发表于： 2020-07-23 15:36:59
分类： MySQL 8.0 新特性
浏览：浏览(689)

MySQL 8.0.18 版本引入了 hash join 功能，来优化没有走索引的等值 join 连接，hash join 通常比 BNL(Block Nested-Loop) 算法效率更高。比如一个常见的等值 join 查询语句，如下：

SELECT *
    FROM t1
    JOIN t2
        ON t1.c1=t2.c1;

一、8.0.20 不加 format=tree 也能显示 hash join

在 8.0.20 版本之前，需要在 explain 语句加上 format=tree 才能显示使用了 hash join，从 8.0.20 版本，explain 默认就可以显示出使用 hash join，explain analyze 同样也会显示 hash join 相关信息，如下所示：

mysql> EXPLAIN
    -> SELECT * FROM t1
    ->     JOIN t2 ON t1.c1=t2.c1\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: t1
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: NULL
*************************** 2. row ***************************
           id: 1
  select_type: SIMPLE
        table: t2
   partitions: NULL
         type: ALL
possible_keys: NULL
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 1
     filtered: 100.00
        Extra: Using where; Using join buffer (hash join)

二、8.0.20 hash join 等值条件的限制被去除

8.0.18 版本使用 hash join 优化的前提条件是两个表 join 连接，至少有一个等值条件，如果没有等值条件，则不能使用 hash join，比如下面这个示例，join 连接的两个表，都有至少一个等值条件：

ELECT * FROM t1
    JOIN t2 ON (t1.c1 = t2.c1 AND t1.c2 < t2.c2)
    JOIN t3 ON (t2.c1 = t3.c1);

从 8.0.20 版本开始，join 的两个表，如果没有使用等值条件，也可以做 hash join，BNL 的支持被移除，之前使用 BNL 的地方将由 hash join 代替，如下：

mysql> EXPLAIN FORMAT=TREE
    -> SELECT * FROM t1
    ->     JOIN t2 ON (t1.c1 = t2.c1)
    ->     JOIN t3 ON (t2.c1 < t3.c1)\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (t1.c1 < t3.c1)  (cost=1.05 rows=1)
    -> Inner hash join (no condition)  (cost=1.05 rows=1)
        -> Table scan on t3  (cost=0.35 rows=1)
        -> Hash
            -> Inner hash join (t2.c1 = t1.c1)  (cost=0.70 rows=1)
                -> Table scan on t2  (cost=0.35 rows=1)
                -> Hash
                    -> Table scan on t1  (cost=0.35 rows=1)

在 8.0.20 及之后的版本，hash join 去除了等值连接的限制条件，使得下面这些原来不能使用 hash join 优化的类型，在 8.0.20 及之后的版本，都可以使用 hash join 优化了。

1. 笛卡尔连接使用 hash join：
对于没有指定 join 条件的笛卡尔连接，同样可以使用 hash join，如下：

mysql> EXPLAIN FORMAT=TREE
    -> SELECT *
    ->     FROM t1
    ->     JOIN t2
    ->     WHERE t1.c2 > 50\G
*************************** 1. row ***************************
EXPLAIN: -> Inner hash join  (cost=0.70 rows=1)
    -> Table scan on t2  (cost=0.35 rows=1)
    -> Hash
        -> Filter: (t1.c2 > 50)  (cost=0.35 rows=1)
            -> Table scan on t1  (cost=0.35 rows=1)

2. Inner non-equi-join：

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 JOIN t2 ON t1.c1 < t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (t1.c1 < t2.c1)  (cost=4.70 rows=12)
    -> Inner hash join (no condition)  (cost=4.70 rows=12)
        -> Table scan on t2  (cost=0.08 rows=6)
        -> Hash
            -> Table scan on t1  (cost=0.85 rows=6)

3. Semijoin：

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 
    ->     WHERE t1.c1 IN (SELECT t2.c2 FROM t2)\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop inner join
    -> Filter: (t1.c1 is not null)  (cost=0.85 rows=6)
        -> Table scan on t1  (cost=0.85 rows=6)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (c2=t1.c1)
        -> Materialize with deduplication
            -> Filter: (t2.c2 is not null)  (cost=0.85 rows=6)
                -> Table scan on t2  (cost=0.85 rows=6)

4. Antijoin：

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t2 
    ->     WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1)\G
*************************** 1. row ***************************
EXPLAIN: -> Nested loop antijoin
    -> Table scan on t2  (cost=0.85 rows=6)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (c1=t2.c1)
        -> Materialize with deduplication
            -> Filter: (t1.c1 is not null)  (cost=0.85 rows=6)
                -> Table scan on t1  (cost=0.85 rows=6)

5. Left outer join：

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 LEFT JOIN t2 ON t1.c1 = t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Left hash join (t2.c1 = t1.c1)  (cost=3.99 rows=36)
    -> Table scan on t1  (cost=0.85 rows=6)
    -> Hash
        -> Table scan on t2  (cost=0.14 rows=6)

6. Right outer join（MySQL 把所有右连接改写为左连接）：

mysql> EXPLAIN FORMAT=TREE SELECT * FROM t1 RIGHT JOIN t2 ON t1.c1 = t2.c1\G
*************************** 1. row ***************************
EXPLAIN: -> Left hash join (t1.c1 = t2.c1)  (cost=3.99 rows=36)
    -> Table scan on t2  (cost=0.85 rows=6)
    -> Hash
        -> Table scan on t1  (cost=0.14 rows=6)

三、hash join 参数

8.0.18 版本在参数 optimizer_switch 中引入选项 hash_join=on/off，来控制是否使用 hash join 优化，同样也可以在 SQL 中使用标记 HASH_JOIN，NO_HASH_JOIN 在会话级达到同样的目的。但是从 8.0.19 版本开始，这些参数和设置将不再生效，之前使用 BNL 算法的地方，全部由 hash join 代替。。

hash join 的内存使用通过参数 join_buffer_size 来控制，默认为 256KB，hash join 使用的内存大小不能超过该参数设置的值，当 hash join 需要的内存大小超过该值时，将会使用磁盘文件来存储，磁盘文件的创建会受参数 open_files_limit 限制，因此适当调大上述两个参数，将有助于 hash join 执行更高效。

从 8.0.18 版本开始，hash join 使用的内存 buffer 是逐步增加的，因此可以将 join_buffer_size 调大，对于小的查询，并不会一次就把所有 join buffer 申请完，所以不用担心内存占用过大的问题。

赞(1)
踩(0)
打赏

文章评论

共0条评论