In the previous instalment of this series I introduced three ways of accessing the run-time statistics for a query and described, for one of the methods, the basics of the information we can get and how we can use it. In this article I want to expand on the use of one method to show it can help use identify and resolve performance problems. The method is known (after Wolfgang Breitling’s exposition of it some 10 or more years ago) as Cardinality Feedback.

Skewed Data

I have the following query – which I may not have analyzed properly before putting it into production:

select
	t1.id,
	t2.small_vc
from
	t1,
	t2
where
	t1.date_ord >= trunc(sysdate) - 14
and	t1.supp_id = 1
and	t2.id = t1.id
order by
	t1.id
;

When I enable rowsource execution statistics by setting statistics_level to all (which increased the run time from 0.02 seconds to 0.25 seconds) I got the following information from the in-memory execution plan:

select * from table(dbms_xplan.display_cursor(null,null,'iostats last +cost'));

-----------------------------------------------------------------------------------------------------------------
| Id  | Operation                      | Name    | Starts | E-Rows | Cost (%CPU)| A-Rows |   A-Time   | Buffers |
-----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT               |         |      1 |        |   206 (100)|   2800 |00:00:00.23 |     799 |
|   1 |  SORT ORDER BY                 |         |      1 |      1 |   206   (2)|   2800 |00:00:00.23 |     799 |
|   2 |   NESTED LOOPS                 |         |      1 |      1 |   205   (1)|   2800 |00:00:00.21 |     799 |
|   3 |    NESTED LOOPS                |         |      1 |      1 |   205   (1)|   2800 |00:00:00.15 |     560 |
|*  4 |     TABLE ACCESS BY INDEX ROWID| T1      |      1 |      1 |   202   (1)|   2800 |00:00:00.09 |     200 |
|*  5 |      INDEX RANGE SCAN          | T1_DATE |      1 |  14013 |    41   (3)|  14000 |00:00:00.04 |      40 |
|*  6 |     INDEX RANGE SCAN           | T2_I1   |   2800 |      1 |     2   (0)|   2800 |00:00:00.03 |     360 |
|   7 |    TABLE ACCESS BY INDEX ROWID | T2      |   2800 |      1 |     3   (0)|   2800 |00:00:00.02 |     239 |
-----------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   4 - filter("T1"."SUPP_ID"=1)
   5 - access("T1"."DATE_ORD">=TRUNC(SYSDATE@!)-14)
   6 - access("T2"."ID"="T1"."ID")

Note that I’ve included “+cost” as an extra format parameter in the call to display_cursor(); conveniently this shows me where the optimizer thinks the resource costs are going to be – that little bit of information is very useful as an adjunct to the E-Rows information. I’ve also used the “iostats” option rather than the “allstats” option – I wouldn’t normally do this but it’s a way to eliminate the memory statistics which made the report much wider to show just a little extra information that the “sort order by” at operation 1 needed (and got) about 100KB of memory to complete as an optimal workarea operation.

Before we even begin to do a comparison of estimates and actuals, we can note something that stands out as an efficiency threat in this example: the E-Rows in lines 5 and 4 tell us that Oracle expects to acquire 14,000 rowids from index t1_date and discard all but one of the rows after using those rowids to visit table t1 – and most of the predicted cost of the query (viz: 202 – 41) comes from the work done visiting that table. The optimizer seems to think that it has to pick a very poor quality index as the best way of addressing this query. Looking at the access() predicate on line 5 and the filter() predicate on line 4, we might decide that (a) we need to extend the index that seems to exist on just (date_ord) so that it becomes (date_ord, supp_id), or (b) if there’s an index already on (supp_id) we need to extend it to (supp_id, date_ord), or maybe (c) create an index on (supp_id) – with or without date_ord – because it’s missing and ought to exist.

Having noted that we may not have the right indexes anyway, we can still go through the process of comparing actuals and estimates as this may help us towards an informed decision on the infrastructure anyway. (Sometimes you may decide that there’s no point in chasing the execution statistics further until the indexing is corrected, of course.)

The order of operation (i.e. the order in which rowsources are produced) for the plan is: 5, 4, 6, 3, 7, 2, 1 – so we’ll consider the plan lines in that order, remembering that the E-Rows figure must be multiplied by the Starts figure before you compare it to the A-Rows figure:

5 – index range scan – estimated rows 14,013, starts 1; actual rows 14,000. Good prediction.
4 – table access by rowid – estimated rows 1, starts 1; actual rows 2,800. Bad prediction.

There are two reasons why we have to worry about line 4: first, why is the prediction so bad? secondly will the prediction for this line result in a bad choice of execution plan (e.g. wrong type of joins from this point onwards or is the whole join order wrong because of this extreme low volume/high cost prediction)?

6 – index range scan – estimated rows 1, starts 2,800; actual rows 2,800. Good prediction!

How did we get back to a good prediction after the terrible prediction for operation 4? The answer to that question is another question – “why not?” Although we have to remember to multiply E-Rows by Starts before comparing it to A-Rows, it’s not necessarily the case that a good match is a good thing. The line-by-line comparison between actuals and estimates is a local comparison, and a good local prediction isn’t automatically indicative of a good global strategy.

In this case if we access table t2 by the index on id we really will get one rowid and one one row per access with very few buffer visits – and that’s what the prediction says, so it really is a good prediction. But the fact that we will have to execute (start) that operation 2,800 times is (probably) a bad thing.

We don’t have a column called “E-Starts” in the output, but if we did it would hold the value 1 (the value of E-Rows from the driving table of the nested loop join) and that would help us recognize where the important error in the prediction took place. When we look at the A-Rows column of a plan we need to know whether the volume of data comes from the number of rows per execution, or the number of executions.

3 – nested loop – estimated rows 1, starts 1; actual rows 2,800. Bad prediction.
7 – table access by rowid – estimated rows 1, starts 2,800; actual rows 2,800. Good (local) prediction again.
2 – nested loop – estimated rows 1, starts 1; actual rows 2,800. Bad prediction.
1 – Sort order by – estimated rows 1, starts 1; actual rows 2,800. Bad prediction.

If we summarize the state of this query plan, then: the E-Rows for lines 5 and 4 suggest that the optimizer knew that it didn’t have a high-precision option for accessing table t1. However when we consider the E-Rows and A-Rows for line 4 we see that the optimizer had a bad idea of what the data in table t1 looked like, and the entire plan from that point onwards was (probably) inappropriate because of that initial estimate. Ignoring the initial error, every other step (join order and access method) that the optimizer dictated from that point on behaved as the optimizer expected – each of the later nested loop joins used a high precision index to identify and return a small volume of data per execution – but by that point we were already locked into something that was potentially a bad plan. This leads us to two general points about using execution plans with rowsource execution statistics as a tool for highlighting root causes of performance problems.

  • At any point in a plan, significant differences between actuals and estimates may point to flaws in our infrastructure, such as poor choice of index definitions or problems with defining or collecting statistics.
  • The earliest point in an execution plan where the estimated row count differs significantly from the actual row count is probably the point at which the optimizer’s choice of execution plan has gone wrong, and we should investigate the cause of that error first.

The problem in this particular case was that there was a significant skew in the data. There were several hundred supplier IDs in table t1, but a large fraction of the rows were supplied by a very small subset of the suppliers. Since I had not created a histogram on column supp_id the optimizer had effectively estimated the number of rows in table t1 as (num_rows / supp_id.num_distinct), but supplier 1 was one of my high-volume suppliers, with 2,800 items supplied in the last two weeks.

There is no trivial, generic, solution to this class of problem; the best strategy is a compromise that depends on too many variables, so I won’t go into details of the ways you could address it. As a side-note to the main topic, though, I will just mention that since I wasn’t using bind variables in this query, and since there were two predicates on table t1, and since I was using Oracle 11.2.0.4, the optimizer marked the initial cursor as is_sharable = ‘N’, use_feedback_stats = ’Y’ and re-optimized the statement the second time I ran it, with an internal set of opt_estimate() hints reflecting the actual run-time statistics produced by the first execution. This produced a different execution plan which included a hash join into t2 (still using the same date index into t1).

Scalar Subquery Caching

Another class of problem we can run into where rowsource execution statistics can be helpful is in the area of scalar subqueries, whether in the select list or in the where clause. In part 10 (Guesswork) of this series I introduce a variant of the following query and its execution plan:

select
        count(*)
from    (
        select  /*+ no_merge */
                outer.*
        from
                emp outer
        where
                outer.sal > (
                        select  /*+ no_unnest */
                                avg(inner.sal)
                        from
                                emp inner
                        where
                                inner.dept_no = outer.dept_no
                )
        )
;

--------------------------------------------------------------
| Id  | Operation             | Name | Rows  | Bytes | Cost  |
--------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |     1 |       |   245 |
|   1 |  SORT AGGREGATE       |      |     1 |       |       |
|   2 |   VIEW                |      |   167 |       |   245 |
|*  3 |    FILTER             |      |       |       |       |
|   4 |     TABLE ACCESS FULL | EMP  | 20000 |   156K|    35 |
|   5 |     SORT AGGREGATE    |      |     1 |     8 |       |
|*  6 |      TABLE ACCESS FULL| EMP  |  3333 | 26664 |    35 |
--------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("OUTER"."SAL"> (SELECT /*+ NO_UNNEST */
              AVG("INNER"."SAL") FROM "EMP" "INNER" WHERE "INNER"."DEPT_NO"=:B1))
   6 - filter("INNER"."DEPT_NO"=:B1)

The emp table lists 20,000 employees spread over 6 departments. The variant of the query I used in part 10 didn’t reduce the result set to an inline no_merge() view with a  simple count(*) as this code does, but the key features of the query are unchanged. The question, of course, was how many times does the correlated (scalar) subquery run? The answer, in principle, was “any number you like between 6 (number of departments) and 20,000 (number of employees)”. If you check the Cost figures you will see that Oracle has based its arithmetic on a model that expects to execute the subquery 6 times:  245 = 35 (tablescan at operation 4) + 6 * 35 (tablescan at operation 6).

So when you add a couple of new departments to the company, fire a few people, hire a few people, and move a few employees into the new departments what happens? For a very small change in data volume the performance of the query could change dramatically. Initially this query completed in 0.35 seconds; after changing the dept_no (to carefully selected values) for a couple of (carefully selected) employees the run time jumped to a little over 3 minutes – all CPU. Here’s the execution plan, with rowsource execution statistics after the change:

----------------------------------------------------------------------------------------
| Id  | Operation             | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT      |      |      1 |        |      1 |00:03:09.75 |    1467K|
|   1 |  SORT AGGREGATE       |      |      1 |      1 |      1 |00:03:09.75 |    1467K|
|   2 |   VIEW                |      |      1 |    167 |   9997 |00:03:09.71 |    1467K|
|*  3 |    FILTER             |      |      1 |        |   9997 |00:03:09.63 |    1467K|
|   4 |     TABLE ACCESS FULL | EMP  |      1 |  20000 |  20000 |00:00:00.08 |     220 |
|   5 |     SORT AGGREGATE    |      |   6671 |      1 |   6671 |00:03:09.31 |    1467K|
|*  6 |      TABLE ACCESS FULL| EMP  |   6671 |   3333 |     22M|00:01:38.37 |    1467K|
----------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   3 - filter("OUTER"."SAL">)
   6 - filter("INNER"."DEPT_NO"=:B1)

As with the nested loop join in the first example we really could do with two “Starts” columns, one for A-Starts (actuals) and one for E-Starts (estimated), but all we have is the actuals. Despite the fact that the optimizer modelled for 6 starts of the subquery, it actually ran it 6,671 times because the run-time engine was unable to keep all the required subquery results cached.

In this case, the query was artificially constructed to demonstrate a particular class of the scalar subquery problem – the correct solution would be to stop blocking the subquery unnesting – but in real-life this type of behaviour does occur and it’s nice to know that we can determine exactly where the problem appears as it tells us something about how we need to rewrite (or get Oracle to rewrite) the query.

The same type of issue (perhaps more prevalent nowadays) appears with scalar subqueries in the select list. If you have a query with multiple scalar subqueries in the select list how do you find out where the time is going when the query takes too long to complete? It’s likely that you can get a lot of help from rowsource execution statistics. Based on the same table, here’s a query, with its execution plan and run-time stats demonstrating the point:

select
        dept_no,
        emp_no,
        (select sum(sal) from emp e2 where e2.dept_no = e1.dept_no) sal_tot
from
        emp e1
order by
        dept_no, emp_no
;

----------------------------------------------------------------------------------------------------------------
| Id  | Operation          | Name | Starts | E-Rows | A-Rows |   A-Time   | Buffers |  OMem |  1Mem | Used-Mem |
----------------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT   |      |      1 |        |  20000 |00:00:04.71 |     731K|       |       |          |
|   1 |  SORT AGGREGATE    |      |   3338 |      1 |   3338 |00:00:04.59 |     731K|       |       |          |
|*  2 |   TABLE ACCESS FULL| EMP  |   3338 |   3333 |     11M|00:01:06.77 |     731K|       |       |          |
|   3 |  SORT ORDER BY     |      |      1 |  20000 |  20000 |00:00:04.71 |     731K|   832K|   511K|  739K (0)|
|   4 |   TABLE ACCESS FULL| EMP  |      1 |  20000 |  20000 |00:00:00.08 |     219 |       |       |          |
----------------------------------------------------------------------------------------------------------------

With one unlucky combination of department and employee the inline scalar subquery ran 3,338 times instead of once per department – which is a clue that this query should probably have used a non-mergable aggregate view (or materialized factored subquery) with a hash join.

Summary

We have seen that rowsource execution statistics can tell you how many times an operation is run, how much work it actually does in total, and whether the optimizer’s estimate of how much data it will return is realistic or not – remembering particularly that we should be comparing A-Rows with E-rows * Starts. We have also noted that this final bit of information is “local” and that an accurate local estimate isn’t necessarily indicative of a good “global” strategy.

There are a couple of places where we can apply global thinking to the rowsource execution stats, though. The first is in looking for the earliest place in an execution plan where the actual number of rows in a driving rowsource (typically the first rowsource of a join) differs from the estimated number by a significant factor – this is likely to lead to subsequent poor choices of join method, or even be the source of a bad initial join order. The second is finding places where a large fraction of the work of the query comes from scalar subqueries being executed far more frequently than we might hope or expect – and without some form of the rowsource execution stats it isn’t generally possible to know how many times each scalar subquery will have to be executed.

 

Tags: , , , , ,