Dynamic Sampling (III) – Real-Life Data – Part II

In the previous installment of this series we saw that Dynamic Sampling wasn’t of great help when dealing with certain types of data patterns. We tried to add a suitable index to improve the Dynamic Sampling results, but no luck so far. The solution in this case is that with statistics in place the index only gets used when adding

In the previous installment of this series we saw that Dynamic Sampling wasn’t of great help when dealing with certain types of data patterns. We tried to add a suitable index to improve the Dynamic Sampling results, but no luck so far.

The solution in this case is that with statistics in place the index only gets used when adding the DYNAMIC_SAMPLING_EST_CDN hint (see the previous parts of the series for an explanation of that hint). I’m not sure why this is as in my understanding it is only used to improve the selectivity estimate and not to overwrite existing cardinality/blocks statistics, but that’s the way the code at present works.

Re-running the experiment now looks like this:

And the results:

And the trace file shows:

That looks very promising – with a very small overhead we get a spot on cardinality estimate. Note that by small overhead I mean the sampling overhead, of course there is the additional overhead of maintaining the index if it was added only for that specific purpose.

If you check the optimizer trace file carefully, you’ll notice two recursive queries issued by the Dynamic Sampling code. The first one corresponds to those we’ve already seen, but the second one is different: It explicitly uses the index PARCELS_IDX; doesn’t use the SAMPLE clause but restricts the result set to the first 2,500 rows.

The idea seems to be that if Oracle finds a suitable index and decides to use it (more on that in a moment) it can very cheaply check for a very precise selectivity / cardinality estimate in case of rare values (which in this case means less than 2,500 occurrences) based on the index.

What happens if the index query identifies 2,500 rows (or more)? In that case Oracle doesn’t know exactly how many rows are identified and it uses the greater cardinality of the two queries to arrive at the cardinality estimate. This means that depending on the data pattern and the results of the two queries you might end up with a pretty incorrect estimate, but it might at least be better than the result of the single query used by default.

Note: If you happen to use partitioned tables there are a number of nasty bugs related to Dynamic Sampling in versions prior to 11.2.0.2 / 11.2.0.3 respectively. In particular the cardinality estimate based on the additional index-based query can be completely misleading, for more information see this blog post.

Unfortunately the code is at present pretty picky when deciding whether to use indexes or not.

The first point is that for multi-column indexes to be used all predicates need to be index access predicates, which means the index won’t be used if at least one of the predicates can only be applied by filtering the index.

For example, if the above index was created using the two columns in a different order (PICKUP_DATE, DELIVERY_STATUS), then it won’t be used since DELIVERY_STATUS cannot be used as an index access predicate due to the range comparison on PICKUP_DATE (this is implied by the different order of the index entries).

I can understand this decision to a certain extent as otherwise it becomes unpredictable how many index blocks have to effectively be read to scan for 2,500 rows in the index.

This also means that certain kinds of complex predicates, like for example:

cannot be covered by Dynamic Sampling using indexes, since it is not possible to use all three predicates as access predicates no matter how the index is defined.

What I can’t understand is the fact that adding another index that starts with one of the columns used as predicates disables the usage of the indexes completely.

For example, if I add this index:

this is the result for the previous, unchanged test case I get:

So we are back to square one. And this is what the optimizer trace file tells us:

So both indexes were recognized, but none of them were used to run the index-based query. I believe this could be classified as a bug, but all versions up to the latest one available at the time of writing this (11.2.0.3) show this behaviour.

Note: You’ll notice that using the DYNAMIC_SAMPLING_EST_CDN hint seems to implicitly override the sanity checks – the results of the Dynamic Sampling get used even with questionable results. This is something to keep in mind when using the approach outlined here.

Beyond Dynamic Sampling And Indexes

If you happen to have such cases that you can’t cover with either Dynamic Sampling, potentially supported by indexes, nor Extended Statistics on Column Groups, then you might be able to get improved cardinality estimates by using suitable Virtual Columns or Extended Statistics expressions.

This is, however, only applicable in cases where you can change the query the application submits, and where you know the expressions upfront.

For our example here, assuming the index option is not feasible for whatever reason, you could try to add the following Virtual Column or Extended Statistics expression (only from 11g on, pre-11g would need a corresponding Function-Based Index):

only to find out that you get an error “only pure functions can be specified in a virtual column expression”, due to the SYSDATE expression.

Assuming a suitable expression, just as an example here (adjust the literal expression as necessary):

you can add such a virtual column and after gathering statistics on that column:

Note: It’s very important to remember that Extended Statistics, Virtual Columns (or the Virtual Columns added implicitly via Function-Based Indexes) need to have statistics gathered for, otherwise the optimizer can’t leverage the added value of such expressions. Although default METHOD_OPT options like “FOR ALL COLUMNS SIZE AUTO” will include such additional columns the next time statistics are gathered, it is important to explicitly gather statistics after you’ve just added such columns if you want to take full advantage of them

You could run a query like the following, without the need for Dynamic Sampling, indexes or even histograms:

So the cardinality estimate is spot on, but you need to know the expressions upfront, they need to be static and the expression needs to be supported by the Virtual Column code. Of course this approach could be further refined to serve more complex expressions, but then again histograms might be required to handle the virtual column data statistics properly.

Summary

Dynamic Sampling can be very useful, but if you’re looking for rare values, or if the data you’re looking for is not uniformly distributed across the table but clustered together, the default Dynamic Sampling might fail to produce helpful results.

Sometimes you might be able to get Dynamic Sampling working by adding a suitable index, but you need to check carefully if the Dynamic Sampling code decides to actually use the index or not, and if statistics exist, you need to explicitly specify the DYNAMIC_SAMPLING_EST_CDN hint for the table in question, which is a pity, because it makes generic implementations more complex.

In other cases where you can influence the query issued you might able to overcome such issues by using corresponding Virtual Column expressions, if these are static in nature and supported by the Virtual Column code.

The purpose of the whole exercise is to arrive at more reasonable cardinality estimates – based on the knowledge that incorrect cardinality estimates are very often the root cause of inefficient execution plans (If you want to learn more why this is so important, you could watch my Webinar recording available here on Cost-Based Optimizer Basics).

Go to Dynamic Sampling (III) – Real-Life Data – Part I