12c Hybrid histogram

In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type

In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type which occurs when the data set fails to qualify for a TOP-Frequency. Firstly, it shows that the method used by Oracle to compute a Hybrid histogram is strongly related to the TOP-Frequency preliminary determination step. It then shows that a column value can be popular, non-popular with an endpoint value or non-popular without an endpoint value. It then examines the formulas used by the optimizer to compute its cardinality estimation for each of the above three cases.

Natural or Adjusted?

In order to investigate the method used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from Jonathan Lewis‘ blog as shown below:

In the following, I am going to gather Hybrid histogram in 12c (12.1.0.2.0) by specifying fewer buckets (20) than distinct value (37) for the current data set and using default sampling:

As in the previous article, the above call to dbms_stats package has been traced so that we can explore its content in the following snippet:

As expected, Oracle started by carrying out the usual TOP-Frequency preliminary determination step. It is worth emphasizing that this data set fails to immediately qualify for a TOP-Frequency histogram because the TOP-20 values account for 80 rows while the threshold is 95 rows, as shown below:

This is what I tend to qualify as a “Natural Hybrid” histogram representing a data set which doesn’t qualify for a TOP-Frequency because the threshold is greater than the naturally non-adjusted TopNRows, as explained in the previous article. However, what happens for a data set that initially satisfies the TOP-Frequency threshold but fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the Adjusted TopNRows is greater than the threshold. A dbms_stats trace file of such a kind of data set shows a different content:

While in the first case Oracle immediately abandoned the Top-Frequency path, in the second case it went a little bit deeper before aborting the top-frequency process in favor of a Hybrid histogram. This is what I tend to qualify as the “Adjusted Hybrid” histogram representing a data set that fails to qualify for a TOP-Frequency because Oracle finds that the threshold is greater than the Adjusted TopNRows in the middle of the process.

The method used by Oracle to compute Hybrid histogram depends on whether the data set is of a “Natural Hybrid” type or of an “Adjusted Hybrid” one. For the later type Oracle will take profit of the pre-process it has accomplished and transform the aborted top-frequency into a hybrid histogram. In the former case Oracle will use a bucket-frequency strategy.

The data set used in this article qualifies for a “Natural Hybrid” histogram. This is why the next section explains the method used by Oracle to collect this kind of hybrid histogram. The explanation of the “Adjusted Hybrid” method will be left to a separate article.

The method

Oracle has computed a Hybrid histogram for the current data set using a bucket size of 5 (sample_size/num_buckets = 100/20), as shown below:

At this stage of the article, let’s label this bucket size as the “original” bucket size. The method used by Oracle to compute this kind of histogram seems to obey the following strategy:

  1. a distinct value should not be found in more than one bucket
  2. a bucket size is allowed to be extended in order to contain all instances of the same distinct value

As long as the bucket size can be extended so that no distinct value will span multiple buckets, deciphering the new size of the extended buckets represents a crucial step in figuring out the method used by Oracle to compute Hybrid histogram. Below is the histogram information stored in the data dictionary:

Excluding the boundaries, and up to the 13th endpoint number 86, all new bucket sizes are greater or equal to the original bucket size. In order to have a clue about what’s happening behind the scenes, let’s draw an ordered picture of the data set divided into 20 buckets each of size 5 as shown below:

Oracle starts walking the above ordered data set until it finds that value 13 traverses two buckets. Based on the first rule above, Oracle extends bucket n°1 by moving all instances of 13 from bucket_id n°2 into it, giving us these intermediary results:

Note that 13 is not only an endpoint number of bucket n°1 but since it is uniquely located into a single bucket, Oracle is able to count its frequency of 3 (endpoint_repeat_count) as shown in the histogram information reproduced below:

However, the above histogram information is showing that bucket n°2 is of size 6 and has 18 as its endpoint number, with a frequency of 2. This is not in accordance with this picture:

Looking at this new situation we can spot that when jumping from bucket n°2 to bucket n°3 there are no distinct values crossing multiple buckets. So we could keep this bucket with its new size of 4. This, however, is not the case. In fact, it appears that there is a third rule used by Oracle when generating “Natural Hybrid” histogram which is:

  1. do not create buckets with a size less than the original size (5) –not applicable at the end of the data set–

As long as the above rule is applied, with bucket n°2 being of size 4, it has to be extended to reach a size of 5. This is why the first instance from bucket n°3 (18) has to be moved into bucket n°2 as in the following:

However, since there are two instances of 18 in bucket n°3 they have to move together to obey the first rule in the strategy: no distinct values spanning multiple buckets. This gives us the following situation:

As a result of this, we see that we do indeed have a bucket n°2 of size 6, with 18 as its endpoint number and a frequency of 2. We will see later in this article that it is this rule 3) of the may introduce instability in the Hybrid histogram.

Oracle will continue traversing the rest of the data set applying the three point strategy above until it finds that value 23 spans multiple buckets:

Applying the same rule again, all instances of values 23 will be moved from bucket _id n°6 to bucket_id n°5, resulting in the following situation:

And so on until it reaches the end of the data set producing such a final histogram result:

You will notice that, starting from the thirteenth buckets at value 39, the bucket extension rule and minimum size requirement seem to be ignored. At this stage of the data set Oracle is left with 14 distinct values to spread over 8 buckets. This last phrase paved the way to what seems to be the fourth rule applied by Oracle when gathering Natural Hybrid histogram:

  1. the original number of buckets (20) should not be reduced

Instead of reducing the original number of buckets (20), Oracle seems to divide the number of remaining distinct values (14) by the number of remaining buckets (8) and spread them accordingly. Unfortunately, I didn’t succeed in working out the exact strategy used by Oracle in this last step to spread the data across the remaining buckets, rather than the one which consists of keeping the initial number of buckets unchanged.

Popular or non-popular?

Oracle developers and tuning specialists are interested in having optimal execution plans. When the Optimizer is supplied with accurate statistics it will mostly come up with an optimal execution plan. However, when the optimizer uses a histogram to get its cardinality estimate, the arithmetic it employs for a popular value differs from that used for a non-popular value. This is why identifying the popularity of a column value represents a crucial step in the tuning process. This section of the article shows when a column with a hybrid histogram is considered popular or not and shows the arithmetic used by Oracle to compute the cardinality estimate for each case.

When it comes to a hybrid histogram Oracle considers a value to be popular when its recorded endpoint_repeat_count value exceeds the original bucket size. The endpoint_repeat_count is a new 12c histogram column property in which Oracle stores the number of times a distinct value has been repeated in a bucket. Let’s put this in a SQL format:

The Popularity column above represents the difference between the frequency of appearance of each column and the bucket size (sample_size/num_buckets). It can easily be inferred from this column that a value is considered popular when its Popularity column value is greater than zero. Refactoring the above query a little gives this:

Simply put, a captured Hybrid histogram value is considered to be:

  • popular when its endpoint_repeat_count is greater than the original bucket size
  • non-popular with a recorded endpoint number in the histogram
  • non-popular and absent from the histogram information

Thus, in contrast to its predecessor, the Height Balance histogram, it’s not sufficient for a Hybrid value to be captured into the histogram with an endpoint_repeat_count greater than 1 in order for it to be considered popular.

Cardinality estimation

As stated above the popularity of Hybrid histogram value dictates the arithmetic used by Oracle to calculate its estimation. The preceding section illustrates three types of Hybrid histogram values:

  • popular value
  • non-popular value with an endpoint number
  • non-popular value not present in the histogram table

Let’s start by figuring out how Oracle estimates the cardinality of the first type:

Popular value

Below is a list of popular values in the current data set:

Let’s get the Oracle optimizer cardinality estimation using one of the above values:

In order to get the above cardinality estimation, Oracle uses the following formula:

Non-Popular value with an endpoint number

For the data set considered in this article, the following values are non-popular values having an endpoint number:

Let’s then get the Oracle optimizer cardinality estimation of one of the above non-popular values:

In contrast to what you might probably guess, the optimizer calculates the cardinality estimates for a non-popular value having endpoint number using the following different formula:

The optimizer calculates the NewDensity using an internal algorithm based on factors like the number of distinct values, the number of popular buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by Alberto Dell’ Era) we can get a reliable value for the NewDensity:

Using this computed NewDensity in the above cardinality estimation formula gives this:

The corresponding 10053 CBO trace backs the above arithmetic as shown in the following:

Note how this trace file shows the same information (NewDensity, BktCnt, PopBktCnt, PopValCnt and NDV) we have selected above when calculating the NewDensity. The same trace file also shows that the computed cardinality is exactly equal to the one we’ve calculated using the supplied formula.

In order to give more credit to this formula let’s apply it to get the cardinality estimation of another non-popular value having an endpoint_repeat_count of 3 which is greater than 1 as shown below:

Non-Popular value without an endpoint number

In order to calculate the cardinality estimates of a hybrid column value that has not been captured into the histogram, the Optimizer uses the following formula:

The code below illustrates this formula applied to one of the non-captured values:

As the above formula suggests, all non-popular hybrid histogram values that have not been captured will always have the same constant cardinality estimation whatever their real count is in the table.

Instability

We stated above that when applying rule 3) to ensure the minimum bucket size of 5 when building its Hybrid histogram, Oracle introduces a certain instability. With this actual data set of 100 rows and 37 distinct values, Oracle ended up by making the value 17 as non-popular without an endpoint and 18 as an endpoint number with a endpoint_repeat_count of 2 as shown below:

If we add a new instance of 16 the reverse situation will happen:

See how “18” has been excluded from the histogram information while “17” has been incorporated into it. This new situation is the result of the following new data set organization:

Which, when passed through the Hybrid histogram crunching strategy gives this:

Add an extra instance of ‘16′ and both 17 and 18 will quit the histogram information while ‘16′ will be recorded with a correct frequency:

Conclusion

Frequency histograms, provided you’ve collected them at the right moment using a representative sampling, can help the optimizer in getting correct cardinality estimates. However, their internal upper bound limit of 254 distinct values drastically limits their usefulness in favor of the imprecise height balanced histogram which needs, for a distinct value to be properly captured, to have one of its instances appearing into at least two endpoints of two distinct buckets. Oracle 12cR1 has not only extended the internal limit of frequency histogram to 2048 but has implemented two new types of histogram: Top-Frequency and Hybrid. The later histogram type combines the best attributes of its two predecessors; that is the accuracy of frequency histogram by counting the number of times a distinct value repeats itself and the bucket saving space of the height balanced histogram by collapsing all instances of the same distinct value into a single bucket without increasing the original bucket number. The hybrid histogram generates more popular values and therefore a greater chance for the Oracle Optimizer to come up with an optimal execution plan.

Code for the examples: InsT1