mysql5.6.9rc 子查询优化之后带来的问题
测试数据如下:
表t1和t2通过t1_id关联在一起;
t1中只有一条记录的type为0,另外三条记录的type为1;
对于t1中的每个t1_id, t2中都有6条记录与其关联。
CREATE TABLE `t1` ( `t1_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `type` INT(10) UNSIGNED NOT NULL DEFAULT '0', PRIMARY KEY (`t1_id`), KEY `type` (`type`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `t2` ( `t2_id` INT(10) UNSIGNED NOT NULL AUTO_INCREMENT, `t1_id` INT(10) UNSIGNED NOT NULL DEFAULT '0', PRIMARY KEY (`t2_id`), KEY `t1_id` (`t1_id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; INSERT INTO t1 SET TYPE=0; INSERT INTO t1 SET TYPE=1; INSERT INTO t1 SET TYPE=1; INSERT INTO t1 SET TYPE=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=1; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=2; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=3; INSERT INTO t2 SET t1_id=4; INSERT INTO t2 SET t1_id=4; INSERT INTO t2 SET t1_id=4; INSERT INTO t2 SET t1_id=4; INSERT INTO t2 SET t1_id=4; INSERT INTO t2 SET t1_id=4; mysql> SELECT * FROM t1; +-------+------+ | t1_id | TYPE | +-------+------+ | 1 | 0 | | 2 | 1 | | 3 | 1 | | 4 | 1 | +-------+------+ 4 ROWS IN SET (0.00 sec) mysql> SELECT * FROM t2; +-------+-------+ | t2_id | t1_id | +-------+-------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 2 | | 8 | 2 | | 9 | 2 | | 10 | 2 | | 11 | 2 | | 12 | 2 | | 13 | 3 | | 14 | 3 | | 15 | 3 | | 16 | 3 | | 17 | 3 | | 18 | 3 | | 19 | 4 | | 20 | 4 | | 21 | 4 | | 22 | 4 | | 23 | 4 | | 24 | 4 | +-------+-------+ 24 ROWS IN SET (0.00 sec) |
写一条sql语句,在t1中按照type分组,在每组中各取一个t1_id,找出它们对应的t2_id来,
个人比较习惯的还是使用in:
SELECT t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE); |
在mysql5.5中看一下:
mysql> SELECT version(); +------------+ | version() | +------------+ | 5.5.22-log | +------------+ 1 ROW IN SET (0.02 sec) mysql> SELECT t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE); +-------+-------+ | t2_id | t1_id | +-------+-------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 2 | | 8 | 2 | | 9 | 2 | | 10 | 2 | | 11 | 2 | | 12 | 2 | +-------+-------+ 12 ROWS IN SET (0.01 sec) mysql> EXPLAIN SELECT t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE); +----+--------------------+-------+-------+---------------+-------+---------+------+------+--------------------------+ | id | select_type | TABLE | TYPE | possible_keys | KEY | key_len | REF | ROWS | Extra | +----+--------------------+-------+-------+---------------+-------+---------+------+------+--------------------------+ | 1 | PRIMARY | t2 | INDEX | NULL | t1_id | 4 | NULL | 24 | USING WHERE; USING INDEX | | 2 | DEPENDENT SUBQUERY | t1 | INDEX | NULL | TYPE | 4 | NULL | 1 | USING INDEX | +----+--------------------+-------+-------+---------------+-------+---------+------+------+--------------------------+ 2 ROWS IN SET (0.00 sec) |
返回12条数据,用到了DEPENDENT SUBQUERY, 如果数据量大的话, 效率会非常低。
可以把此语句改写成inner join的形式。
据说mysql 5.6对于子查询做了相当不错的优化,
那就看看mysql5.6.9的表现:
mysql> SELECT version(); +--------------+ | version() | +--------------+ | 5.6.9-rc-log | +--------------+ 1 ROW IN SET (0.00 sec) mysql> SELECT t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE); +-------+-------+ | t2_id | t1_id | +-------+-------+ | 1 | 1 | | 2 | 1 | | 3 | 1 | | 4 | 1 | | 5 | 1 | | 6 | 1 | | 7 | 2 | | 8 | 2 | | 9 | 2 | | 10 | 2 | | 11 | 2 | | 12 | 2 | | 13 | 3 | | 14 | 3 | | 15 | 3 | | 16 | 3 | | 17 | 3 | | 18 | 3 | | 19 | 4 | | 20 | 4 | | 21 | 4 | | 22 | 4 | | 23 | 4 | | 24 | 4 | +-------+-------+ 24 ROWS IN SET (0.00 sec) mysql> EXPLAIN SELECT t2_id,t1_id FROM t2 WHERE t2.t1_id IN (SELECT t1_id FROM t1 GROUP BY TYPE); +----+-------------+-------+-------+---------------+-------+---------+---------------+------+-------------+ | id | select_type | TABLE | TYPE | possible_keys | KEY | key_len | REF | ROWS | Extra | +----+-------------+-------+-------+---------------+-------+---------+---------------+------+-------------+ | 1 | SIMPLE | t1 | INDEX | PRIMARY | TYPE | 4 | NULL | 4 | USING INDEX | | 1 | SIMPLE | t2 | REF | t1_id | t1_id | 4 | test.t1.t1_id | 1 | USING INDEX | +----+-------------+-------+-------+---------------+-------+---------+---------------+------+-------------+ 2 ROWS IN SET (0.00 sec) |
竟然返回了全部的24条记录……DEPENDENT SUBQUERY被优化成了SIMPLE, 效率肯定是提高了,可惜返回的结果集有问题…
首先你的subquery的用法不是标准的group by用法,而是mysql extensions to group by的语法,因为你的查询列没有出现在group by list里面,而且查询列t1_id并不能保证每个group里唯一。
然后是mysql 优化的问题,5.6将subquery优化成semi join的时候,如果子查询中仅有group by约束,没有having或者集函数时,则会直接把group by过滤掉(参见remove_redundant_subquery_clauses函数),因为optimizer以为你的查询列必然是group里面的列, 优化掉group by对查询列没有影响,所以这就导致了原始语句转化成:
select `t2`.`t2_id` AS `t2_id`,`t2`.`t1_id` AS `t1_id` from `t2` semi join (`t1`) where (1 and (`t2`.`t1_id` = `t1`.`t1_id`))
所以现在有两种途径解决这个问题:
1. 让你的group by语法是规范的语法,而不要用mysql扩展的语法,而且这种扩展的语法在其他数据库是直接报错的,迁移性也不好。
2. 在group by后面增加一个having子句,如having t1_id > -1,这种恒真的语法,让执行计划发生变化,当然这样优化器就利用不了semi join了,而是subquery类型了。
我查了下mysql bug系统,你好像没提交bug,我建了个新bug,用了你的测试用例。
http://bugs.mysql.com/bug.php?id=68254