Yet Again On Subqueries

…with a bit of strategic thinking

They come back, every now and then. Subqueries are far from being perfect at MySQL and they can give you some serious headaches.

Skilled MySQL developers know it better. They avoid subqueries as much as they can. It is not that subqueries do not work, it is just that the optimizer sometimes is, well, “not that optimised”.

So you may stay away from subqueries with some good SQL review. But what happens when the subquery is automatically generated by a script or a tool? If you can change the statement, I’m afraid you need to find some serious workarounds that vary case by case.

Here is an example that I found few weeks ago when I visited one of our customers.

Our customer used Magento for its site. Magento used a couple of queries that I will report here as sales and sales_items, although they are not their real name:

CREATE TABLE `sales` (

   `sales_id`     int(11) NOT NULL,

   `sales_entity` int(11) NOT NULL,

   `sales_status` int(11) NOT NULL,

   PRIMARY KEY (`sales_id`),

   KEY `ix2` (`sales_entity`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1;

CREATE TABLE `sales_items` (

   `sales_item_id` int(11) NOT NULL,

   `product_id`    int(11) NOT NULL,

   `quantity`      int(11) NOT NULL,

   PRIMARY KEY (`sales_item_id`),

   KEY `ix2` (`product_id`)

) ENGINE=InnoDB DEFAULT CHARSET=latin1;

Both queries had tens of thousands of rows, but in our example I will use only few rows:

mysql> select * from sales;

+----------+--------------+--------------+

| sales_id | sales_entity | sales_status |

+----------+--------------+--------------+

|        1 |            1 |            1 |

|        2 |            1 |            2 |

|        3 |            2 |            1 |

|        4 |            2 |            2 |

|        5 |            3 |            1 |

|        6 |            3 |            2 |

|        7 |            4 |            1 |

|        8 |            4 |            2 |

|        9 |            5 |            1 |

|       10 |            5 |            2 |

+----------+--------------+--------------+

10 rows in set (0.00 sec)

mysql> select * from sales_items;

+---------------+------------+----------+

| sales_item_id | product_id | quantity |

+---------------+------------+----------+

|             1 |          1 |        1 |

|             2 |          1 |        1 |

|             3 |          2 |        1 |

|             4 |          2 |        1 |

|             5 |          3 |        1 |

|             6 |          3 |        1 |

+---------------+------------+----------+

6 rows in set (0.00 sec)

The application was running very slow. When we analysed the query log using SkySQL Enterprise Monitor, we found an UPDATE like this:

UPDATE sales SET sales_status = ?

 WHERE sales_entity IN (

       SELECT DISTINCT sales_item_id

         FROM sales_items

        WHERE product_id IN (?)
 )

The update was executed several times in a minute and it took several seconds to be executed. We already knew that the issue was in the subquery, but just to explain the situation to the customer we wanted to show an execution plan.

As you may know, you cannot use EXPLAIN with and UPDATE statement, therefore we use a SELECT that acts like the UPDATE. This SELECT did the trick:

SELECT * FROM sales

 WHERE sales_entity IN (
       SELECT DISTINCT sales_item_id

         FROM sales_items

        WHERE product_id IN (3)
);

+----------+--------------+--------------+

| sales_id | sales_entity | sales_status |

+----------+--------------+--------------+

|        9 |            5 |            1 |

|       10 |            5 |            2 |

+----------+--------------+--------------+

2 rows in set (0.00 sec)

The execution plan of the query is:

EXPLAIN SELECT * FROM sales

         WHERE sales_entity IN (

               SELECT DISTINCT sales_item_id
                 FROM sales_items

                WHERE product_id IN (3)
        );

+----+--------------------+-------------+-----------------+---------------+---------+---------+------+------+-------------+

| id | select_type        | table       | type            | possible_keys | key     | key_len | ref  | rows | Extra       |

+----+--------------------+-------------+-----------------+---------------+---------+---------+------+------+-------------+

|  1 | PRIMARY            | sales       | ALL             | NULL          | NULL    | NULL    | NULL |   10 | Using where |

|  2 | DEPENDENT SUBQUERY | sales_items | unique_subquery | PRIMARY,ix2   | PRIMARY | 4       | func |    1 | Using where |

+----+--------------------+-------------+-----------------+---------------+---------+---------+------+------+-------------+

2 rows in set (0.00 sec)

As you can see, the query uses the index in the subquery to retrieve the rows from sales_items, but then it does not use the index on sales_entity in sales. The result is that the UPDATE takes several seconds to be executed, because mysqld performs a full scan of the sales table.

If you want to check the details of this issue, you can have a look at the infamous bug 18826 – http://bugs.mysql.com/bug.php?id=18826

So what should you do to fix the problem? If you cannot change the statement, I am afraid there is not much you can do. We were luck and the customer could change the statement. By changing the select with a JOIN that replaces the subquery, something like:

SELECT sales.* FROM sales JOIN sales_items

    ON  (sales.sales_entity = sales_items.sales_item_id )

 WHERE sales_items.product_id in (3);

+----------+--------------+--------------+

| sales_id | sales_entity | sales_status |

+----------+--------------+--------------+

|        9 |            5 |            1 |

|       10 |            5 |            2 |

+----------+--------------+--------------+

2 rows in set (0.00 sec)

The SELECT gives the same result of the previous SELECT and the EXPLAIN command shows:

EXPLAIN SELECT sales.* FROM sales JOIN sales_items

            ON  (sales.sales_entity = sales_items.sales_item_id )

         WHERE sales_items.product_id in (3);

+----+-------------+-------------+------+---------------+------+---------+--------------------------------+------+-------------+

| id | select_type | table       | type | possible_keys | key  | key_len | ref                            | rows | Extra       |

+----+-------------+-------------+------+---------------+------+---------+--------------------------------+------+-------------+

|  1 | SIMPLE      | sales_items | ref  | PRIMARY,ix2   | ix2  | 4       | const                          |    2 | Using index |

|  1 | SIMPLE      | sales       | ref  | ix2           | ix2  | 4       | test.sales_items.sales_item_id |    1 |             |

+----+-------------+-------------+------+---------------+------+---------+--------------------------------+------+-------------+

2 rows in set (0.00 sec)

So ix2 can be used to filter the rows in the SELECT, and ultimately in the UPDATE.

In fact, the new UPDATE is:

UPDATE sales JOIN sales_items
    ON ( sales.sales_entity = sales_items.sales_item_id )

   SET sales_status = ?

 WHERE sales_items.product_id IN ( ? )

The server tales few milliseconds to execute the query, fixing also the headache of our developers better than a pain killer.

Some people may argue that the use of subqueries is not efficient and developers should learn some good SQL. Unfortunately, this is not the point. Products like Magento and others – I have primarily in mind BI and reporting tools – are supposed to be database agnostic and generate queries for many RDBMSs with minor customisation. Bugs like this give the end users the impression that MySQL is terribly slow and is simply not good enough for the job, so they should refer to expensive commercial RDBMSs – the ones that have subqueries working properly.

I really, really hope that the optimiser team at Monty Program one day will fix this bug. We will see even more adoption of MySQL in the Enterprise world, something that has been always underestimated. More projects in the “Enterprise world” mean more resources injected into the MySQL ecosystem, resources (funds, skilled developers and DBAs, expertise) that may be used to improve the product and the solutions around it.

Written by Ivan Zoratti

25 January 2011 at 9:39 pm

Posted in MySQL, MySQL Server, Subqueries, Uncategorized

10 Responses

Subscribe to comments with RSS.

In the Google patch, the parser was changed to reject subqueries and views — problem avoided

Mark Callaghan

25 January 2011 at 11:39 pm

Reply
- Avoided? More like ‘Magento[1] does not run at all on Google patched MySQL’
  
  [1] Put any other software that uses subqueries extensively.
  
  Mchl
  
  26 January 2011 at 11:52 am
  
  Reply
Use LEFT JOIN is wrong. Remove the LEFT keyword

home

26 January 2011 at 12:32 am

Reply
- You are right! I took the examples from the application literally, but for this example I simplified the query by removing the left join. I updated the post and replaced the LEFT JOIN with the INNER JOIN.
  
  Ivan Zoratti
  
  29 January 2011 at 12:29 am
  
  Reply
Ivan,

It is not clear to me why you use a LEFT JOIN instead of an ordinary inner JOIN.

Øystein Grøvlen

26 January 2011 at 9:23 am

Reply
- Hi Oystein,
  The LEFT JOIN in this simple example is an error – I updated the post.
  -ivan
  
  Ivan Zoratti
  
  29 January 2011 at 12:30 am
  
  Reply
Why are you using a LEFT JOIN in the UPDATE statement here? An inner join would have the same result in this case. Plus there’s the danger with the LEFT JOIN that if you removing the limiting statement in the WHERE clause, all rows in the left-side-table are changed.

Anon

26 January 2011 at 11:56 pm

Reply
- Yes, that was an error, I have fixed the post.
  
  Thanks!
  -ivan
  
  Ivan Zoratti
  
  29 January 2011 at 12:30 am
  
  Reply
I wonder if it would do (more or less) the same trick to replace the “IN” operator with the “=” operator, ie.

UPDATE sales SET sales_status = ?
WHERE sales_entity = (
SELECT DISTINCT sales_item_id
FROM sales_items
WHERE product_id IN (?)

In other words the question is: does the “inner query” ever return more than a single row?

Peter Laursen

29 January 2011 at 11:01 pm

Reply
I have an old report here actually http://bugs.mysql.com/bug.php?id=39704 that seems to indicate that indexes are used when the “IN” operator is replaced with the “=” operator.

At that time I forgot to see EXPLAIN EXTENDED+SHOW WARNINGS, but my guess is the the optimizer can handle to rewrite the statement with “=” to a reasonable JOIN.

(but this of course requires that you can be certain that a single row only is returned for the “inner query”. Knowledge of the application should tell this.)

Peter Laursen

29 January 2011 at 11:50 pm

Reply