In the previous installment of this series we saw that Dynamic Sampling wasn’t of great help when dealing with certain types of data patterns. We tried to add a suitable index to improve the Dynamic Sampling results, but no luck so far.

The solution in this case is that with statistics in place the index only gets used when adding the DYNAMIC_SAMPLING_EST_CDN hint (see the previous parts of the series for an explanation of that hint). I’m not sure why this is as in my understanding it is only used to improve the selectivity estimate and not to overwrite existing cardinality/blocks statistics, but that’s the way the code at present works.

Re-running the experiment now looks like this:

set echo on

alter session set tracefile_identifier = 'dyn_samp';

alter session set optimizer_dynamic_sampling = 4;

alter session set events '10053 trace name context forever, level 1';

explain plan for
select  /*+ DYNAMIC_SAMPLING_EST_CDN(pcl) */
        *
from
        parcels pcl
where
        pcl.pickup_date >= sysdate - 1
and     pcl.delivery_status = 'In-transit'
;

select * from table(dbms_xplan.display);

And the results:

-------------------------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |  1000 |   180K|    42   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PARCELS     |  1000 |   180K|    42   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | PARCELS_IDX |  1282 |       |     8   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."PICKUP_DATE" IS NOT NULL)

Note
-----
   - dynamic sampling used for this statement (level=4)

And the trace file shows:

** Dynamic sampling initial checks returning TRUE (level = 4).
** Dynamic sampling updated index stats.: PARCELS_IDX, blocks=4065
** Dynamic sampling index access candidate : PARCELS_IDX
** Dynamic sampling updated table stats.: blocks=26518

*** 2012-05-29 19:00:40.282
** Generated dynamic sampling query:
query text :
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PCL") FULL("PCL") NO_PARALLEL_INDEX("PCL") */ 1 AS C1, CASE WHEN "PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 THEN 1 ELSE 0 END AS C2 FROM "PARCELS" SAMPLE BLOCK (0.116902 , 1) SEED (1) "PCL") SAMPLESUB

*** 2012-05-29 19:00:40.297
** Executed dynamic sampling query:
level : 4
sample pct. : 0.116902
actual sample size : 912
filtered sample card. : 0
orig. card. : 1000000
block cnt. table stat. : 26518
block cnt. for sampling: 26518
max. sample block cnt. : 32
sample block cnt. : 31
min. sel. est. : 0.00000102
** Using recursive dynamic sampling card. est. : 780142.451613

*** 2012-05-29 19:00:40.297
** Generated dynamic sampling query:
query text :
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS opt_param('parallel_execution_enabled', 'false') NO_PARALLEL(SAMPLESUB) NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0), NVL(SUM(C3),0) FROM (SELECT /*+ NO_PARALLEL("PCL") INDEX("PCL" PARCELS_IDX) NO_PARALLEL_INDEX("PCL") */ 1 AS C1, 1 AS C2, 1 AS C3  FROM "PARCELS" "PCL" WHERE "PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND ROWNUM <= 2500) SAMPLESUB

*** 2012-05-29 19:00:40.297
** Executed dynamic sampling query:
level : 4
sample pct. : 100.000000
actual sample size : 780142
filtered sample card. : 1000
filtered sample card. (index PARCELS_IDX): 1000
orig. card. : 780142
block cnt. table stat. : 26518
block cnt. for sampling: 26518
max. sample block cnt. : 4294967295
sample block cnt. : 26518
min. sel. est. : 0.00000102
index PARCELS_IDX selectivity est.: 0.00128182
** Using dynamic sampling card. : 780142 ** Using single table dynamic sel. est. : 0.00128182
Table: PARCELS  Alias: PCL
Card: Original: 780142.451613  Rounded: 1000  Computed: 1000.00  Non Adjusted: 1000.00

That looks very promising – with a very small overhead we get a spot on cardinality estimate. Note that by small overhead I mean the sampling overhead, of course there is the additional overhead of maintaining the index if it was added only for that specific purpose.

If you check the optimizer trace file carefully, you’ll notice two recursive queries issued by the Dynamic Sampling code. The first one corresponds to those we’ve already seen, but the second one is different: It explicitly uses the index PARCELS_IDX; doesn’t use the SAMPLE clause but restricts the result set to the first 2,500 rows.

The idea seems to be that if Oracle finds a suitable index and decides to use it (more on that in a moment) it can very cheaply check for a very precise selectivity / cardinality estimate in case of rare values (which in this case means less than 2,500 occurrences) based on the index.

What happens if the index query identifies 2,500 rows (or more)? In that case Oracle doesn’t know exactly how many rows are identified and it uses the greater cardinality of the two queries to arrive at the cardinality estimate. This means that depending on the data pattern and the results of the two queries you might end up with a pretty incorrect estimate, but it might at least be better than the result of the single query used by default.

Note: If you happen to use partitioned tables there are a number of nasty bugs related to Dynamic Sampling in versions prior to 11.2.0.2 / 11.2.0.3 respectively. In particular the cardinality estimate based on the additional index-based query can be completely misleading, for more information see this blog post.

Unfortunately the code is at present pretty picky when deciding whether to use indexes or not.

The first point is that for multi-column indexes to be used all predicates need to be index access predicates, which means the index won’t be used if at least one of the predicates can only be applied by filtering the index.

For example, if the above index was created using the two columns in a different order (PICKUP_DATE, DELIVERY_STATUS), then it won’t be used since DELIVERY_STATUS cannot be used as an index access predicate due to the range comparison on PICKUP_DATE (this is implied by the different order of the index entries).

I can understand this decision to a certain extent as otherwise it becomes unpredictable how many index blocks have to effectively be read to scan for 2,500 rows in the index.

This also means that certain kinds of complex predicates, like for example:

pickup_date >= sysdate - 1 and some_other_date >= sysdate - 5 and delivery_status = 'In-transit'

cannot be covered by Dynamic Sampling using indexes, since it is not possible to use all three predicates as access predicates no matter how the index is defined.

What I can’t understand is the fact that adding another index that starts with one of the columns used as predicates disables the usage of the indexes completely.

For example, if I add this index:

create index parcels_idx2 on parcels (delivery_status);

this is the result for the previous, unchanged test case I get:

-------------------------------------------------------------------------------------------
| Id  | Operation                   | Name        | Rows  | Bytes | Cost (%CPU)| Time     |
-------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT            |             |     1 |   193 |     4   (0)| 00:00:01 |
|   1 |  TABLE ACCESS BY INDEX ROWID| PARCELS     |     1 |   193 |     4   (0)| 00:00:01 |
|*  2 |   INDEX RANGE SCAN          | PARCELS_IDX |     1 |       |     3   (0)| 00:00:01 |
-------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   2 - access("PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."PICKUP_DATE" IS NOT NULL)

Note
-----
   - dynamic sampling used for this statement (level=4)

So we are back to square one. And this is what the optimizer trace file tells us:

** Dynamic sampling initial checks returning TRUE (level = 4).
** Dynamic sampling updated index stats.: PARCELS_IDX, blocks=4066
** Dynamic sampling updated index stats.: PARCELS_IDX2, blocks=2934
** Dynamic sampling updated table stats.: blocks=27984

*** 2012-05-29 19:20:54.473
** Generated dynamic sampling query:
query text :
SELECT /* OPT_DYN_SAMP */ /*+ ALL_ROWS IGNORE_WHERE_CLAUSE NO_PARALLEL(SAMPLESUB) opt_param('parallel_execution_enabled', 'false') NO_PARALLEL_INDEX(SAMPLESUB) NO_SQL_TUNE */ NVL(SUM(C1),0), NVL(SUM(C2),0), NVL(SUM(C3),0), NVL(SUM(C4),0) FROM (SELECT /*+ IGNORE_WHERE_CLAUSE NO_PARALLEL("PCL") FULL("PCL") NO_PARALLEL_INDEX("PCL") */ 1 AS C1, CASE WHEN "PCL"."DELIVERY_STATUS"='In-transit' AND "PCL"."PICKUP_DATE">=SYSDATE@!-1 THEN 1 ELSE 0 END AS C2, CASE WHEN "PCL"."DELIVERY_STATUS"='In-transit' THEN 1 ELSE 0 END AS C3, CASE WHEN "PCL"."PICKUP_DATE">=SYSDATE@!-1 AND "PCL"."DELIVERY_STATUS"='In-transit' THEN 1 ELSE 0 END AS C4 FROM "PARCELS" SAMPLE BLOCK (0.110778 , 1) SEED (1) "PCL") SAMPLESUB

*** 2012-05-29 19:20:54.488
** Executed dynamic sampling query:
level : 4
sample pct. : 0.110778
actual sample size : 1260
filtered sample card. : 0
filtered sample card. (index PARCELS_IDX2): 0
filtered sample card. (index PARCELS_IDX): 0
orig. card. : 1000000
block cnt. table stat. : 27984
block cnt. for sampling: 27984
max. sample block cnt. : 32
sample block cnt. : 31
min. sel. est. : 0.00000064
** Using dynamic sampling card. : 1137414

Table: PARCELS  Alias: PCL
Card: Original: 1137414.193548  Rounded: 1  Computed: 0.73  Non Adjusted: 0.73

So both indexes were recognized, but none of them were used to run the index-based query. I believe this could be classified as a bug, but all versions up to the latest one available at the time of writing this (11.2.0.3) show this behaviour.

Note: You’ll notice that using the DYNAMIC_SAMPLING_EST_CDN hint seems to implicitly override the sanity checks – the results of the Dynamic Sampling get used even with questionable results. This is something to keep in mind when using the approach outlined here.

Beyond Dynamic Sampling And Indexes

If you happen to have such cases that you can’t cover with either Dynamic Sampling, potentially supported by indexes, nor Extended Statistics on Column Groups, then you might be able to get improved cardinality estimates by using suitable Virtual Columns or Extended Statistics expressions.

This is, however, only applicable in cases where you can change the query the application submits, and where you know the expressions upfront.

For our example here, assuming the index option is not feasible for whatever reason, you could try to add the following Virtual Column or Extended Statistics expression (only from 11g on, pre-11g would need a corresponding Function-Based Index):

alter table parcels add (rare_vals as (cast(case when pickup_date >= sysdate - 1 and delivery_status = 'In-transit' then 'Y' else null end) as varchar2(1));

only to find out that you get an error “only pure functions can be specified in a virtual column expression”, due to the SYSDATE expression.

Assuming a suitable expression, just as an example here (adjust the literal expression as necessary):

alter table parcels add (rare_vals as (cast(case when pickup_date >= to_date('30.05.2012 09:13:16', 'DD.MM.YYYY HH24:MI:SS') and delivery_status = 'In-transit' then 'Y' else null end) as varchar2(1));

you can add such a virtual column and after gathering statistics on that column:

exec dbms_stats.gather_table_stats(null, 'parcels', method_opt => 'for columns size 1 rare_vals')

Note: It’s very important to remember that Extended Statistics, Virtual Columns (or the Virtual Columns added implicitly via Function-Based Indexes) need to have statistics gathered for, otherwise the optimizer can’t leverage the added value of such expressions. Although default METHOD_OPT options like “FOR ALL COLUMNS SIZE AUTO” will include such additional columns the next time statistics are gathered, it is important to explicitly gather statistics after you’ve just added such columns if you want to take full advantage of them

You could run a query like the following, without the need for Dynamic Sampling, indexes or even histograms:

explain plan for
select
        *
from
        parcels pcl
where
        rare_vals = 'Y'
;

select * from table(dbms_xplan.display);

-----------------------------------------------------------------------------
| Id  | Operation         | Name    | Rows  | Bytes | Cost (%CPU)| Time     |
-----------------------------------------------------------------------------
|   0 | SELECT STATEMENT  |         |  1000 |   189K|  7743   (3)| 00:01:33 |
|*  1 |  TABLE ACCESS FULL| PARCELS |  1000 |   189K|  7743   (3)| 00:01:33 |
-----------------------------------------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------

   1 - filter("RARE_VALS"='Y')

So the cardinality estimate is spot on, but you need to know the expressions upfront, they need to be static and the expression needs to be supported by the Virtual Column code. Of course this approach could be further refined to serve more complex expressions, but then again histograms might be required to handle the virtual column data statistics properly.

Summary

Dynamic Sampling can be very useful, but if you’re looking for rare values, or if the data you’re looking for is not uniformly distributed across the table but clustered together, the default Dynamic Sampling might fail to produce helpful results.

Sometimes you might be able to get Dynamic Sampling working by adding a suitable index, but you need to check carefully if the Dynamic Sampling code decides to actually use the index or not, and if statistics exist, you need to explicitly specify the DYNAMIC_SAMPLING_EST_CDN hint for the table in question, which is a pity, because it makes generic implementations more complex.

In other cases where you can influence the query issued you might able to overcome such issues by using corresponding Virtual Column expressions, if these are static in nature and supported by the Virtual Column code.

The purpose of the whole exercise is to arrive at more reasonable cardinality estimates – based on the knowledge that incorrect cardinality estimates are very often the root cause of inefficient execution plans (If you want to learn more why this is so important, you could watch my Webinar recording available here on Cost-Based Optimizer Basics).

Go to Dynamic Sampling (III) – Real-Life Data – Part I