In the second installment of this series we looked at individual access paths for the tables in a simple join query to highlight an important flaw in the default model that the optimizer uses for indexes. Having taken advantage of a recent enhancement that addresses that flaw we are now ready to move onto the problems that appear with the query when taken as a whole.

Reprise

Our query joins orders to order lines, and extracts data about orders placed in a given date range with order lines for a given product:

select
        trunc(ord.date_ordered),
        count(*),
        sum(orl.quantity),
        sum(orl.quantity * orl.unit_price) total_value
from
        orders          ord,    
        order_lines     orl
where
        ord.date_ordered >= trunc(sysdate) – 7
and     ord.date_ordered <  trunc(sysdate)
and     orl.order_id     =  ord.order_id
and     orl.product_id   =  101234
group by
        trunc(ord.date_ordered)
order by
        total_value desc
;

We have worked out that we have about 20,000 orders in the week (though the optimizer has estimated 11,500) that are very well packed near the end of the orders table, and about 1,000 order lines for the requested product (and the optimizer agrees with our estimate) scattered widely through the order_lines table.

By setting the table_cached_blocks table preference to an appropriate value (on both tables though, for our purposes, the orders table was the critical one) we have allowed the optimizer to realize that the orders are well clustered so it is happy to use the (date_ordered) index to select the orders for the week; and even without adjustment the optimizer was happy to use the (product_id) index on the order_lines table to select all the order lines for the given product.

The best laid plans…

So what happens when we try to join the tables and combine the two predicates? Here’s the default plan on my system after (once again) setting the table_cached_blocks preference to a suitable value and collecting stats:

------------------------------------------------------------------------------------------------
| Id  | Operation                       | Name         | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                |              |  1083 | 33573 |  1775   (1)| 00:00:09 |
|   1 |  SORT ORDER BY                  |              |  1083 | 33573 |  1775   (1)| 00:00:09 |
|   2 |   HASH GROUP BY                 |              |  1083 | 33573 |  1775   (1)| 00:00:09 |
|*  3 |    FILTER                       |              |       |       |            |          |
|*  4 |     HASH JOIN                   |              |  1083 | 33573 |  1773   (1)| 00:00:09 |
|   5 |      TABLE ACCESS BY INDEX ROWID| ORDER_LINES  |  1083 | 18411 |  1064   (1)| 00:00:06 |
|*  6 |       INDEX RANGE SCAN          | ORL_FK_PRD   |  1083 |       |     4   (0)| 00:00:01 |
|   7 |      TABLE ACCESS BY INDEX ROWID| ORDERS       | 11525 |   157K|   709   (1)| 00:00:04 
|*  8 |       INDEX RANGE SCAN          | ORD_DATE_ORD | 11525 |       |    34   (3)| 00:00:01 |
------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter(TRUNC(SYSDATE@!)>TRUNC(SYSDATE@!)-7)
   4 - access("ORL"."ORDER_ID"="ORD"."ORDER_ID")
   6 - access("ORL"."PRODUCT_ID"=101234)
   8 - access("ORD"."DATE_ORDERED">=TRUNC(SYSDATE@!)-7 AND
              "ORD"."DATE_ORDERED"<TRUNC(SYSDATE@!))

I’m not going to try and explain every detail of the execution plans we’re going to look at but, as you can see in operations 4 through to 8, the optimizer has basically decided to run the two simpler queries I showed you in the previous installment and then joined the two results using a hash join.

To do a hash join, Oracle acquires the whole of the first data set (known as the “build” table) and builds an in-memory hash table for it based on generating hash values for the columns used in equality join predicates; then it accesses the second data set (known as the “probe” table) one row at a time, applies the same hashing function to the equality join columns, and probes the in-memory hash table to see if there are any possible matches. If there is a matching hash value Oracle then checks the actual values of all the join predicates to see if it is a proper match or simply a “hash collision”.

General Note: in an execution plan the first child (operation 5 in the above) to a hash join operation always identifies the build data set, the second child (operation 7 in the above) always identifies the probe data set.

I’ve said that the hash table will be in-memory, but if the hash table is too big to fit in the available memory Oracle has a mechanism for saving it to disc in batches (called partitions) and saving the probe table to disc in matching batches, then doing the join piecewise between matching batches. (See Cost Based Oracle – Fundamentals, Apress 2005, if you want a more detailed description).

In our case Oracle has decided to use the order_lines data as the build table because it thinks that the total volume of data it will fetch will be only 18KB compared to 157KB for the orders table. (Note: the build table is chosen based on the number of bytes acquired, not the number of rows).

From the arithmetic we can see that the optimizer assumes it will be able to keep that 18KB in memory: the cost of acquiring the data from order_lines is 1,064; the cost of acquiring the data from orders is 709, the total cost of the hash join is 1,773 – which is exactly 1064 + 709: the optimizer thinks that (to the nearest unit) the cost of doing the actual hash join is virtually free, i.e.: it’s going to take a little CPU and a little memory, but no extra disc I/O.

Alternatives:

In the first installment we consider two other possible plans – a nested loop join starting with orders, or a nested loop join starting with order_lines. We can get these plans through some fairly simple hinting:

  • Nested loop starting with order_lines: /*+ leading(orl ord) use_nl(ord) index(ord(order_id)) */
  • Nested loop starting with orders: /*+ leading(ord orl) use_nl(orl) index(orl(order_id)) */

These hints tell the optimizer the single join order it is allowed to consider, tells it the join mechanism it must use to join to the table identified, and tells it to use an index that starts with a specific column when it joins to that table. Here’s the resulting plan (minus predicate section) for the first of the two sets of hints:

------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |             |  1083 | 33573 |  3232   (1)| 00:00:17 |
|   1 |  SORT ORDER BY                   |             |  1083 | 33573 |  3232   (1)| 00:00:17 |
|   2 |   HASH GROUP BY                  |             |  1083 | 33573 |  3232   (1)| 00:00:17 |
|*  3 |    FILTER                        |             |       |       |            |          |
|   4 |     NESTED LOOPS                 |             |  1083 | 33573 |  3230   (1)| 00:00:17 |
|   5 |      NESTED LOOPS                |             |  1083 | 33573 |  3230   (1)| 00:00:17 |
|   6 |       TABLE ACCESS BY INDEX ROWID| ORDER_LINES |  1083 | 18411 |  1064   (1)| 00:00:06 |
|*  7 |        INDEX RANGE SCAN          | ORL_FK_PRD  |  1083 |       |     4   (0)| 00:00:01 |
|*  8 |       INDEX UNIQUE SCAN          | ORD_PK      |     1 |       |     1   (0)| 00:00:01 |
|*  9 |      TABLE ACCESS BY INDEX ROWID | ORDERS      |     1 |    14 |     2   (0)| 00:00:01 |
------------------------------------------------------------------------------------------------

An overriding feature here is that the cost of the plan is higher than the cost of the hash join; so it’s not going to be taken by default.

There is a little glitch in the figures produced by this execution plan, so bear with me while I explain the arithmetic. The join mechanism shown is one of the newer nested loop mechanisms (known as “NLJ Batching”) that Oracle can use, and it ends up being reported as two nested loop operations, the first to access the index the second to access the table.

Unfortunately the Rows, Bytes, Cost and Time figures shown against the first nested loop (operation 5) report the figures that are due to the completion of the second nested loop (operation 4). Ideally the figures at operation 5 should read more like:

------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
------------------------------------------------------------------------------------------------
|   5 |      NESTED LOOPS                |             |  1083 | 24909 |  2147   (1)| 00:00:11 |
------------------------------------------------------------------------------------------------

So we’re going to ignore that line and go straight to the figures at operation 4 (optional exercise: after reading the next couple of paragraphs, check that this theoretical line makes sense).

Essentially the plan is telling us that for EACH ROW we receive at operation 6 from the table access to order_lines we’re going to probe the ord_pk index (which the optimizer estimates will require one physical read – hence the 1 in the cost column of operation 8 – and then visit the orders table for a second physical read – hence the 2 in the cost column of operation 9 (and technically we could argue that for complete consistency that really should be a 1, the 2 comes from the same historic hang-over that has allowed operation 5 to display misleading figures).

With this description in hand we can see where the total cost of the nested loop comes from: it’s the cost of getting the data from order_lines plus the cost of getting a row from the orders table 1,083 times, i.e.: 1064 + 1083 * 2 = 3,230. Let’s flush the buffer cache and run the query to check how sensible this model is. Here’s the execution plan pulled from memory after running with rowsource_execution_statistics enabled (though I’ve re-arranged the column order and deleted the memory-related information):

--------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name        | Starts | E-Rows |  A-Rows |   A-Time   | Buffers | Reads  |Cost (%CPU)|
--------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |             |      1 |        |       7 |00:00:00.76 |    4829 |   3189 | 3232 (100)|
|   1 |  SORT ORDER BY                   |             |      1 |   1083 |       7 |00:00:00.76 |    4829 |   3189 | 3232   (1)|
|   2 |   HASH GROUP BY                  |             |      1 |   1083 |       7 |00:00:00.76 |    4829 |   3189 | 3232   (1)|
|*  3 |    FILTER                        |             |      1 |        |      17 |00:00:00.76 |    4829 |   3189 |           |
|   4 |     NESTED LOOPS                 |             |      1 |   1083 |      17 |00:00:00.76 |    4829 |   3189 | 3230   (1)|
|   5 |      NESTED LOOPS                |             |      1 |   1083 |    1215 |00:00:00.37 |    3623 |   2008 | 3230   (1)|
|   6 |       TABLE ACCESS BY INDEX ROWID| ORDER_LINES |      1 |   1083 |    1215 |00:00:00.10 |    1191 |   1191 | 1064   (1)|
|*  7 |        INDEX RANGE SCAN          | ORL_FK_PRD  |      1 |   1083 |    1215 |00:00:00.01 |       5 |      5 |    4   (0)|
|*  8 |       INDEX UNIQUE SCAN          | ORD_PK      |   1215 |      1 |    1215 |00:00:00.26 |    2432 |    817 |    1   (0)|
|*  9 |      TABLE ACCESS BY INDEX ROWID | ORDERS      |   1215 |      1 |      17 |00:00:00.38 |    1206 |   1181 |    2   (0)|
--------------------------------------------------------------------------------------------------------------------------------

As we saw before, the optimizer’s estimate of the number of order_lines was close to, but didn’t match, our superior knowledge. We then see that for each (A-)row at operation 6 we have started operations 8 and 9, finding one row each time on the index probe, but only finding a suitable row occasionally (the few rows for the last week) on visiting the table. Note that – allowing for rounding errors starts * E-rows = A-rows, the significant rounding error being that the optimizer’s “real (unrounded)” estimate of E-rows for the table access was far less than one per visit.

The most significant feature of the plan is the close correspondence between the Reads and the Cost. Everything we do is single block reads so we don’t have to do any “multi-block” adjustment to get from the read count to the optimizer’s cost, there should be a one-to-one match. This is (reasonably accurately) what we see.

First we read 1191 blocks to get the 1215 order_lines A-rows; that’s fairly close to Oracle’s estimate of 1,064 reads to get a predicted 1,083 rows.

For each of those order_lines we access the ord_pk index – the optimizer predicted a single read each time, and we see a total of 817 reads, so some (though not most) of the index leaf blocks got into the cache and were re-used as the query ran.

For each index access we then visited the orders table, again we made 1,215 visits and Oracle reports a total of 1,181 physical reads for this step – we got hardly any benefit from revisiting blocks in the cache here. These two steps demonstrate a fairly general principle, by the way: when doing lots of access to randomly scattered data the index probes tended to get a lot more caching benefit than the table visits … except in special cases.

Notice how the Reads column corroborates my comments about the “correct” values that should appear in the cost, rows, and bytes columns of operation 5. The reads for operation 5 are 1,191 + 817 = 2,008 (compared to my hypothetical 2,147); then the reads in operation 4 come to 2,008 + 1,181 = 3,189 closely matching the predicted cost of 3,230. Although the nested loop is following a new algorithm, the code to supply the numbers to the execution plan is still following the old (8i and earlier) nested loop model.

A key performance feature of this plan (which also makes clearer a flaw in the hash join plan) is that we have to do a lot of random physical reads to acquire data we don’t need. After doing 1,064 physical reads to collect order_lines (a workload done my both plans so far), this plan did a further 1,998 physical reads to find 1,215 rows, then discarded all but 17 of them. That’s a lot of wasted work.

Trade-offs

If the calculations that appear in the first two plans are wonderfully accurate, what about the third? Here’s the plan (again with some cosmetic re-arrangement) after running the query with rowsource execution s1tatistics enabled:

---------------------------------------------------------------------------------------------------------------------------------
| Id  | Operation                        | Name         | Starts | E-Rows | A-Rows |   A-Time   | Buffers | Reads  | Cost (%CPU)|
---------------------------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                 |              |      1 |        |      7 |00:00:01.30 |   38533 |   1201 | 35339 (100)|
|   1 |  SORT ORDER BY                   |              |      1 |   1083 |      7 |00:00:01.30 |   38533 |   1201 | 35339   (1)|
|   2 |   HASH GROUP BY                  |              |      1 |   1083 |      7 |00:00:01.30 |   38533 |   1201 | 35339   (1)|
|*  3 |    FILTER                        |              |      1 |        |     17 |00:00:01.30 |   38533 |   1201 |            |
|   4 |     NESTED LOOPS                 |              |      1 |   1083 |     17 |00:00:01.30 |   38533 |   1201 | 35337   (1)|
|   5 |      NESTED LOOPS                |              |      1 |  11525 |  23628 |00:00:00.83 |   19182 |    635 | 35337   (1)|
|   6 |       TABLE ACCESS BY INDEX ROWID| ORDERS       |      1 |  11525 |  19667 |00:00:00.44 |   18425 |    570 |   709   (1)|
|*  7 |        INDEX RANGE SCAN          | ORD_DATE_ORD |      1 |  11525 |  19667 |00:00:00.07 |      55 |     55 |    34   (3)|
|*  8 |       INDEX RANGE SCAN           | ORL_PK       |  19667 |      1 |  23628 |00:00:00.17 |     757 |     65 |     2   (0)|
|*  9 |      TABLE ACCESS BY INDEX ROWID | ORDER_LINES  |  23628 |      1 |     17 |00:00:00.32 |   19351 |    566 |     3   (0)|
---------------------------------------------------------------------------------------------------------------------------------

If we start by examing the cost we can see the basic nested loop algorithm at work. Operations 6 and 7 are predicted to supply 11,525 rows at a cost of 709; in fact we got 19,667 rows by reading 570 blocks – so a reasonably good estimate from the optimizer. The table access with index range scan did require 18,425 buffer visits, though, which translates into CPU time and potential latch contention.

For each row we do an index (range) probe into the primary key index on the order_lines table, then visit the order_lines table typically once but occasionally a few more times for a total of 23,628 visits discarding all but 17 rows from the table.

Because the order_lines data is very well clustered by date (and order_id) we get a terrific caching benefit as this query is running so we read just 65 index blocks and 566 table blocks to get the data – for a total read requirement of 1,201 physical blocks read.

Looking at the optimizer’s costing, though: it has allowed 2 physical reads for each probe of the order_lines index, and one more read for the table visit. This would be a reasonable estimate for a single order, of course, but the optimizer has predicted 11,525 orders and simply multiplied the estimated cost by the number of rows for a total incremental cost 34,575 – giving a total cost of 34,575 + 709 = 35,284 (with a small rounding error introducing a reported cost estimate of 35,337).

Critically the optimizer has not allowed for the massive “self-caching” benefit that has to appear as the query runs. The worst case scenario here is that we have to read all the blocks for orders placed in the last 7 days and all the blocks for order_lines created in the last 7 days – and both sets of data are well packed for a total of 1,201 blocks. Add to that the fact that the orders placed in the last 7 days are likely to be fairly well cached before we run the query (since order processing, packing, delivery, invoicing, payment receipt etc. all tend to happen over the course of a few days following order placement) and we have a massive overestimate of the cost of this query because Oracle doesn’t understand the order processing business and has a simplistic nested loop model that takes a reasonable cost for ‘one random event’ and multiplies it up by ‘number of driving rows’ without realizing that the nature of the selection from the driving table can eliminate a huge degree of randomness.

Summary

For comparative purposes, here are the runtime activity summaries:

  • Hash join (not shown): 1,191 + 570 = 1,761 physical reads; 19,600 buffer gets; cost = 1,775
  • NLJ from order_lines: 1,191 + 1,998 = 3,189 physical reads; 4,829 buffer gets; cost = 3,232
  • NLJ from orders: 570 + 631 = 1,201 physical reads; 38,533 buffer gets; cost = 35,339

Ironically the costing is a good model for two of the three plans, while the third plan with a terrible cost estimate is likely to be the best plan.

The cost of a typical hash join (when it is expected to complete in memory) is:

Cost of acquiring first data set + cost of acquiring second data set + a little bit

The cost of a typical nested loop join is:

Cost of acquiring first data set  + (rows in first data set * cost of acquiring one related set of items from the second data set)

Because Oracle cannot be fully informed about the prior caching of data and the self-caching that goes on as a query progresses it is very easy for Oracle to overestimate the total amount of physical I/O that will take place as a nested loop join executes. This is a defect that cannot easily be overcome, except through hinting (or, to do it in the approved manner, supplying an SQL Baseline or Outline).

Footnote

If anyone is thinking at this point of fiddling with the parameters optimizer_index_caching (what percentage of ALL indexes should I assume to be cached) and optimizer_index_cost_adj (effectively, for values between 0 and 100, what percentage of ALL tables will NOT be cached when accesses through an index), then think again. When I set optimizer_index_caching = 95 and optimizer_index_cost_adj = 5 the default path was still the hash join; even when I set the parameters to 99 and 1 respectively the hash join still had the lowest cost.

–> Catalogue of current articles in CBO series.