Subqueries in MySQL

A subquery is a type of query that is embedded—or nested—into a data manipulation language (DML) statement. The data returned by the subquery is passed into the DML statement and incorporated into its overall logic. The subquery itself is typically a SELECT statement, although you can also use a TABLE statement or VALUES statement. Even so, the SELECT statement continues to be the most common choice for subqueries, and that’s the one I focus on in this article.

This article is part of Robert Sheldon's continuing series on Learning MySQL. To see all of the items in the series, click here.

With that in mind, let’s dive into the subquery and take a look at several different ones in action. In this article, I focus primarily on how the subquery is used in SELECT statements to retrieve data in different ways from one table or from multiple tables. Like the previous articles in this series, this one is meant to introduce you to the basic concepts of working with subqueries so you have a solid foundation on which to build your skills. Also like the previous articles, it includes a number of examples to help you better understand how to work with subqueries so you can start using them in your DML statements.

Preparing your MySQL environment

For the examples in this article, I used the same database and tables that I used for the previous article. The database is named travel and it includes two tables: manufacturers and airplanes. However, the sample data I use for this article is different from the last article, so I recommend that you once again rebuild the database and tables to keep things simple for this article’s examples. You can set up the database by running the following script:

After you’ve created the database, you can add the sample data so you can follow along with the exercises in this article. Start by running the following INSERT statement to add data to the manufacturers table:

The statement adds nine rows to the manufacturers table, which you can confirm by querying the table. The manufacturer_id value for the first row should be 1001. After you confirm the data in the manufacturers table, you can run the following INSERT statement to populate the airplanes table, using the manufacturer_id values from the manufacturers table:

The manufacturer_id values from the manufacturers table provide the foreign key values needed for the manufacturer_id column in the airplanes table. After you run the second INSERT statement, you can query the airplanes table to confirm that 35 rows have been added. The first row should have been assigned 101 for the plane_id value, and the plane_id values for the other rows should have been incremented accordingly.

Building a basic scalar subquery

A scalar subquery is one that returns only a single value, which is then passed into the outer query through one of its clauses. The subquery is used in place of other possible expressions, such as a constants or column names. For example, the following SELECT statement (the outer query) includes a subquery in search condition of the WHERE clause:

The subquery is the expression on the right side of the equal sign, enclosed in parentheses. A subquery must always be enclosed in parentheses, no matter where it’s used in the outer statement.

Make certain that your subquery does indeed return only one value, if that’s what it’s supposed to do. If the subquery were to return multiple values and your WHERE clause is not set up to handle them (as in this example), MySQL will return an error letting you know that you messed up.

In this case, the subquery is a simple SELECT statement that returns the manufacturer_id value for the manufacturer named Beechcraft. This value, 1003, is then passed into the WHERE clause as part of its search condition. If a row in the airplanes table contains a manufacturer_id value that matches 1003, the row is included in the query results, which are shown in the following figure.

The subquery in this example retrieves data from a second table, in this case, manufacturers. However, a subquery can also retrieve data from the same table, which can be useful if the data must be handled in different ways. For example, the subquery in the following SELECT statement retrieves the average max_weight value from the airplanes table:

The WHERE clause search condition in the outer statement uses the average to return only rows with a max_weight value greater than that average. If you were to run the subquery on its own, you would see that the average maximum weight is 227,499 pounds. As a result, the outer SELECT statement returns only those rows with a max_weight value that exceeds 227,499 pounds, as shown in the following figure.

As I mentioned earlier in the article, you can use subqueries in statements other than SELECT. One of those statements is the SET statement, which lets you assign variable values. You can use a SET statement to define a value that can then be passed into other statements. For example, the following SET and SELECT statements implement the same logic as the previous example and return the same results:

The SET statement defines a variable named @avg_weight and uses a subquery to assign a value to that variable. This is the same subquery that is in the previous example. The SELECT statement then uses that variable in its WHERE clause (in place of the original subquery) to return only those rows with a max_weight value greater than 227,499 pounds.

The examples in this section focused on using scalar subqueries in SELECT statements, but be aware that you can also use them in UPDATE and DELETE statements, as well as the SET clause of the UPDATE.

Working with correlated subqueries

One of the most valuable features of a subquery is its ability to reference the outer query from within the subquery. Referred to as a correlated subquery, this type of subquery can return data that is specific to the current row being evaluated by the outer statement. For example, the following SELECT statement uses a correlated subquery to calculate the average weight of the planes for each manufacturer, rather than for all planes:

Unlike the previous two examples, the subquery now includes a WHERE clause that limits the returned rows to those with a manufacturer_id value that matches the current manufacturer_id value in the outer query. This is accomplished by assigning an alias (a) to the table in the outer query and using that alias when referencing the table’s manufacturer_id column within the subquery.

In this case, I also assigned an alias (a2) to the table referenced within the subquery, but strictly speaking, you do not need to do this. I like to include an alias for consistency and code readability, but certainly take whatever approach works for you.

To better understand how the subquery works logically (as opposed to how the optimizer might actually execute it), consider the first row in the airplanes table, which has a manufacturer_id value of 1001.

When the outer query evaluates the first row, it compares the max_weight value to the value returned by the subquery. To carry out this comparison, the database engine first matches the manufacturer_id value in the outer query to the manufacturer_id values returned by the subquery’s SELECT statement. It then finds all rows associated with the current manufacturer and returns the average max_weight value for that manufacturer, repeating the process for each manufacturer returned by the outer query.

The following figure shows the results returned by the outer SELECT statement. The max_weight values are now compared only with the manufacturer-specific averages.

You can also use this same subquery in the SELECT clause as one of the column expressions. In the following example, I added the subquery after the max_weight column and assigned the alias avg_weight to the new column:

The new subquery works the same as in the preceding example. It returns the average weight only for the planes from the current manufacturer, as shown in the following figure.

In the previous example, I used a subquery to create a generated column. However, you can use a subquery when creating an even more complex generated column. In the following example, I’ve added a generated column named amt_over, which subtracts the average weight returned by the subquery from the weight in the max_weight column:

The avg_weight column in the SELECT list is a generated column that uses a subquery to return the average weight of the current manufacturer. The amt_over column is also a generated column and it uses the same subquery to return the average weight. Only this time, the column subtracts that average from the max_weight column to return the amount that exceeds the average.

The following figure shows the results now returned by the outer SELECT statement. As you can see, they include the amt_over generated column, which shows the differences in the weights.

If you want, you can also retrieve the name of the manufacturer from the manufacturers table and include that in your SELECT clause, as shown in the following example:

This statement is similar to the previous statement except that it adds the manufacturer column, a generated column. The new column uses a subquery to retrieve the name of the manufacturer from the manufacturers table, based on the manufacturer_id value returned by the outer query.

In some cases, a subquery might not perform as well as other types of constructs. For example, MySQL can often optimize a left outer join better than a subquery that carries out comparable logic. If you’re using a subquery to perform an operation that can be achieved in another way and are concerned about performance, you should consider testing both options under a realistic workload to determine which is the best approach.

The following figure shows the results with the additional column, which I’ve named manufacturer.

One other detail I want to point out about using subqueries in the SELECT list is that you can include them even if you’re grouping and aggregating data. For instance, the following SELECT statement includes a subquery that retrieves the name of the manufacturer associated with each group:

The subquery in this statement is similar to those you’ve seen in other examples, except that now you’re dealing with aggregated data, so the subquery must use a column that is mentioned in the GROUP BY clause of the outer statement, which it does (the manufacturer_id column). The statement returns the results shown in the following figure.

As you can see, the data has been grouped based on the manufacturer_id column, and the name of the manufacturer is included with each ID. In addition, the number of airplanes with piston engines is provided for each manufacturer, with the results sorted in descending order, based on that amount.

Working with a row of data

So far, all the examples in this article have returned scalar values, but your subqueries can also return multiple values, as noted earlier. For example, it might be useful to use a subquery to aggregate a table’s data, calculate specific averages in that data (returned as a single row), and then retrieve rows from the same table that exceed those averages, which is what I’ve done in the following SELECT statement:

Notice that the WHERE clause in the outer statement includes the max_weight and parking_area columns in parentheses. In MySQL, this is how you create what is called a row constructor, a structure that supports simultaneous comparisons of multiple values. In this case, the row constructor is compared to the results from the subquery, which returns two corresponding values. The first value is the average weight, as you saw earlier. The second value returns the average parking area, which is based on the values in the parking_area column.

In both cases, the averages are specific to the current manufacturer_id value in the outer query. For the outer query to return a row, the two values in the row constructor must be greater than both corresponding values returned by the subquery. Notice also that the SELECT list now includes both the max_weight and parking_area columns, along with the average for each one. The outer statement returns the results shown in the following figure.

One thing you might have noticed about the subqueries in the preceding examples, particularly those used in the WHERE clauses, is that all the search conditions in those clauses use basic comparison operators, either equal (=) or greater than (>). However, you can use any of the other comparison operators, including special operators such as IN and EXISTS, as you’ll see in the next section.

Working with a column of data

The previous section covered row subqueries. A row subquery can return only a single row, although that row can include one or more columns. This is in contrast to a scalar subquery , which returns only a single row and single column. In this section, we’ll look at the column subquery, which returns only a one column with one or more rows.

As with row subqueries, column subqueries are often used in the WHERE clause when building your search conditions. Also like row subqueries, your search condition must take into account that the subquery is returning more than one value.

For example, the WHERE clause in the following SELECT statement uses the IN operator to compare current the manufacturer_id value in the outer statement with the list of manufacturer_id values returned by the subquery:

The column data returned by the subquery includes only those manufacturers that offer planes with piston engines (as reflected in the current data set). The IN operator determines whether the current manufacturer_id value is included in that list. If it is, that row is returned, as shown in the following figure.

As with many operations in MySQL, you can take different approaches to achieve the same results. For example, you can replace the IN operator with an equal comparison operator, followed by the ANY keyword, as shown in the following example:

You could have instead used another comparison operator, such as greater than (>) or lesser than (<) or even ALL instead of ANY. The statement returns the same results as the previous example. In fact, you can also achieve the same results by rewriting the entire statement as an inner join:

I’m not going to go into joins here, but as I mentioned earlier, joins can sometimes provide performance benefits over subqueries, so you should be familiar with how they work and how they differ from subqueries (a topic that could easily warrant its own article).

With that in mind, be aware that you can also use the NOT keyword with some operators to return different results. For example, the WHERE clause in the following SELECT statement uses the NOT IN operator ensure the current manufacturer_id value is not in the list of values returned by the subquery:

In this case, the subquery returns a distinct list of all manufacturer_id values in the airplanes table. The list is then compared to each manufacturer_id value in the manufacturers table, as specified by the outer query. If the value is not in the list, the search condition evaluates to true and the row is returned. In this way, you can determine which manufacturers are in the manufacturers table but are not in the airplanes table. The query results are shown in the following figure.

Be very careful when using NOT IN with your subqueries. If the subquery returns a NULL value, your WHERE expression will never evaluate to true.

Another operator you can use in the WHERE clause is EXISTS (and its counterpart NOT EXISTS). The EXISTS operator simply checks whether the subquery returns any rows. If it does, the search condition evaluates to true, otherwise it evaluates to false. For example, the following SELECT statement defines similar logic as the preceding example, except that it checks for which manufacturers are included in both tables:

Notice that you need only specify the EXISTS operator followed by the subquery. If the subquery returns a row for the current manufacturer_id value, the search condition evaluates true and the outer query returns a row for that manufacturer. The following figure shows the results returned by the statement.

When working with column subqueries, it’s important to understand how to use comparison operators such as IN, NOT IN, ANY, ALL, and EXISTS. If you’re not familiar with them, be sure to refer to the MySQL documentation to learn more.

Using subqueries in the FROM clause

In addition to rows, columns, and scalar values, subqueries can also return tables, which are referred to as derived tables. Like other subqueries, a table subquery must be enclosed in parentheses. In addition, it must also be assigned an alias, similar to specifying a table alias when building correlated subqueries. For example, the following SELECT statement includes a table subquery named total_planes, which is included in the FROM clause of the outer statement:

Notice that the outer statement does not specify a table other than the derived table returned by the subquery. The subquery itself groups the data in the airplanes table by the manufacturer_id values and then returns the ID and total number of planes in each group. The outer statement then finds the average number of planes across all groups. In this case, the statement returns a value of 5.00.

Now let’s look at another example of a table subquery. Although this next example is similar to the previous one in several ways, it includes something you have not seen yet, one subquery nested within another. In this case, I’ve nested a table subquery within another table subquery to group data based on custom categories and then find the average across those groups:

The innermost subquery—the one with the CASE expression—assigns one of five category values (A, B, C, D, E, and F) to each range of parking area values. The subquery returns a derived table named plane_size, which contains a single column named category. The column contains a category value for each plane in the airplanes table.

The data from the plane_size table is then passed to the outer subquery. This subquery groups the plane_size data based on the category values and generates a second column named amount, which provides the total number of planes in each category. The outer subquery returns a derived table named plane_cnt. The outer statement then finds the average number of planes across all groups in the derived table, returning a value of 5.83.

Working with MySQL subqueries

Like many aspects of MySQL, the topic of subqueries is a much broader than what can be covered in a single article. To help you complete the picture, I recommend that you also check out the MySQL documentation on subqueries, which covers all aspects of how to use subqueries. In the meantime, you should have learned enough here to get a sense of how subqueries work and some of the ways you can use them in your SQL statements. Once you have a good foundation, you can start building more complex subqueries and use them in statements other than SELECT queries. Just be sure to keep performance in mind and consider alternative statement strategies, when necessary, especially if working with larger data sets.