Dynamic Sampling (III) – Real-Life Data – Part I

In the previous parts of the series I outlined the basics on Dynamic Sampling. Let’s see how Dynamic Sampling deals with typical real-life data patterns. As you might be aware, quite often real data follows certain patterns for how data is clustered together, usually related to how the data arrives. This clustering (or scattering) of data within a table can

In the previous parts of the series I outlined the basics on Dynamic Sampling. Let’s see how Dynamic Sampling deals with typical real-life data patterns.

As you might be aware, quite often real data follows certain patterns for how data is clustered together, usually related to how the data arrives. This clustering (or scattering) of data within a table can be influenced by the physical organisation of the table, but in the most commonly used Heap Organized Table in Oracle the data can be stored generally in any table block and hence usually follows the natural pattern dictated by the data arrival.

Furthermore quite often the most interesting data is the latest data, and the data your queries might be interested in is usually not spread randomly across the table, but clustered together in a specific area of the table.

If you think about how Dynamic Sampling works by randomly sampling – usually only a very limited number of blocks (commonly something between 32 and 256 blocks) – then it might become obvious that problems dealing with such clustered data patterns can arise, because very likely the few blocks sampled won’t contain a representative sub-set of the data we are looking for.

This means that Dynamic Sampling works well if the data you’re looking for is spread randomly across a table, but by default has problems when the data you’re looking for is either clustered together or simply very rare. In both cases chances are high that the blocks sampled don’t contain a representative sub-set of the data you’re looking for, and the sanity checks built into the code will then kick in and reject the Dynamic Sampling results.

Let’s see a simple test case for how Dynamic Sampling deals with such a data pattern.

The Baseline

What I simulate here is a typical problem of correlated column values – I’m looking for the most recent data that is in a certain status that represents “unprocessed” data. Of course, it is very likely that the most recent data is the data that is yet to be processed, so these two questions basically ask for the same information, but in two different ways.

The example is based on the “parcels” example Jonathan Lewis used in his “Statistics” chapter of the book “Expert Oracle Practices” (used with permission).

There are 1,000 parcels per day for the last 1,000 days, and the last 1,000 parcels are marked as “In-transit”.

We are looking for a good cardinality estimate for the following query:

So both filter predicates identify 1,000 rows each out of 1,000,000 and since I have generated a histogram on the highly skewed “delivery_status” column the optimizer is aware of this. Applying the predicates independently from each other shows that the optimizer’s estimates are in the right ballpark.

The problem however is that applying both together by default results in an estimate of 1 out 1,000,000 since the optimizer doesn’t know about the correlation.

Since one of the predicates is based on a non-equal comparison the new “Extended Statistics” of 11g is of no help either, since it works for column groups only if all filters on the column group are based on equal comparisons.

So we are left with Dynamic Sampling. At the session default level of 2 Dynamic Sampling won’t kick in. We need to either increase the session/cursor level (according to the level descriptions, level 4 or higher sound reasonable for correlated column values), or we could ask for an explicit table level Dynamic Sampling. Remember that the number of blocks sampled is defined differently for the “session/cursor” and “table” level Dynamic Sampling (as explained in the previous parts of this series).

Let’s start with session level Dynamic Sampling set to level 4:

As expected and already outlined above that didn’t really help. There is no notice of Dynamic Sampling in the DBMS_XPLAN.DISPLAY output, although we would expect Dynamic Sampling to kick in under these circumstances.

Let’s have a look at the optimizer trace file:

So Dynamic Sampling was performed, but the result was rejected due to the fact that the sample size was just 31 blocks and no positive samples were found.

Another try with a higher level that results in a greater sample size, session level 8 in this case:

We sampled almost four percent of the original table (1023 blocks), but still no luck.

Note that your results might vary, depending on the random blocks sampled – some levels below 10 might produce usable results, most however won’t.

Clearly this looks like bad luck, even higher levels of Dynamic Sampling don’t generate a reasonable estimate, and other options like Extended Statistics won’t help either.

Indexes For Help

But wait, I’ve already mentioned in the previous parts of the series that sometimes Dynamic Sampling can take advantage of indexes. So let’s create such an index:

Since the index should allow arriving at a very precise cardinality estimate for rare values (less than 2,500 occurrences, more on that later) even with a small sample size let’s repeat the experiment using level 4:

Trace file snippet:

Unfortunately this doesn’t look much better: We can tell from the line “filtered sample card. (index PARCELS_IDX): 0” that the code recognized the index, but somehow it didn’t really use it.

In the next part of this series we’ll see if the indexes can be of any help and what further options exist, if any.

Go to Dynamic Sampling (II) – Controlling The Activity