In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type which occurs when the data set fails to qualify for a TOP-Frequency. Firstly, it shows that the method used by Oracle to compute a Hybrid histogram is strongly related to the TOP-Frequency preliminary determination step. It then shows that a column value can be **popular**, **non-popular with an endpoint value** or **non-popular without an endpoint value**. It then examines the formulas used by the optimizer to compute its cardinality estimation for each of the above three cases.

## Natural or Adjusted?

In order to investigate the method used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from Jonathan Lewis‘ blog as shown below:

SQL> create table t1 (id number, n1 number); SQL> start InsT1.sqlScript (see downloadable script at the end)

In the following, I am going to gather Hybrid histogram in 12c (12.1.0.2.0) by specifying fewer buckets (20) than distinct value (37) for the current data set and using default sampling:

```
SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16));
SQL> BEGIN
dbms_stats.gather_table_stats
(user, 't1', method_opt => 'for columns n1 size 20');
END;
/
SQL> exec dbms_stats.set_global_prefs('TRACE', null);
```

As in the previous article, the above call to dbms_stats package has been traced so that we can explore its content in the following snippet:

DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y Y Y Y Y Y Y N1 DBMS_STATS: Approximate NDV Options DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U DBMS_STATS: start processing top n values for column "N1" DBMS_STATS: >> frequency histograms is not feasible (topn_values is null), skip! DBMS_STATS: Iteration: 1 numhist: 1 DBMS_STATS: NNV NDV AVG MMX HST EP RP NNNP IND CNDV HSTN HSTR COLNAME DBMS_STATS: Y Y Y Y Y N1 DBMS_STATS: Building Histogram for N1 DBMS_STATS: bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20 DBMS_STATS: Trying hybrid histogram

As expected, Oracle started by carrying out the usual TOP-Frequency preliminary determination step. It is worth emphasizing that this data set fails to immediately qualify for a TOP-Frequency histogram because the TOP-20 values account for 80 rows while the threshold is 95 rows, as shown below:

SQL> select sum (cnt) TopNRows from (select n1 ,count(*) cnt from t1 group by n1 order by count(*) desc ) where rownum <= 20; TOPNROWS ---------- 80 SQL> select round ((20-1)/20 * 100) threshold from dual; THRESHOLD ---------- 95

This is what I tend to qualify as a ** “Natural Hybrid”** histogram representing a data set which doesn’t qualify for a TOP-Frequency because the threshold is greater than the naturally non-adjusted

`TopNRows`

, as explained in the previous article. However, what happens for a data set that initially satisfies the TOP-Frequency threshold but fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the `Adjusted TopNRows`

is greater than the threshold. A dbms_stats trace file of such a kind of data set shows a different content:DBMS_STATS: start processing top n values for column "N1" DBMS_STATS: >> frequency histograms is not feasible (topn_values is null), skip! DBMS_STATS: start processing top n values for column "N2" DBMS_STATS: topn sql (len: 589): DBMS_STATS: +++ select /*+ no_parallel(t) no_parallel_index(t) dbms_stats cursor_sharing_exact use_weak_name_resl dynamic_sampling(0) no_monitoring xmlindex_sel_idx_tbl no_substrb_pad */ substrb(dump("N2",16,0,64),1,240) val ,rowidtochar(rowid) rwid from "XXXX"."T1" t where rowid in (chartorowid('AAAJmWAAEAACN1zAAA'),chartorowid('AAAJmWAAEAACN1zAAB') ,chartorowid('AAAJmWAAEAACN1zAAC'),chartorowid('AAAJmWAAEAACN1zAAF' ,chartorowid('AAAJmWAAEAACN1zAAG'),chartorowid('AAAJmWAAEAACN1zAAH') ,chartorowid('AAAJmWAAEAACN1zAAJ'),chartorowid('AAAJmWAAEAACN1zAAU')) order by "N2" DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,2b add: Typ=2 Len=3: c2,2,1a DBMS_STATS: removal_count: 1 total_nonnull_rows: 200 mnb: 8 DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .85225054 vs. .91 DBMS_STATS: Building Histogram for N2 DBMS_STATS: Trying hybrid histogram

While in the first case Oracle immediately abandoned the Top-Frequency path, in the second case it went a little bit deeper before aborting the top-frequency process in favor of a Hybrid histogram. This is what I tend to qualify as the ** “Adjusted Hybrid”** histogram representing a data set that fails to qualify for a TOP-Frequency because Oracle finds that the threshold is greater than the Adjusted TopNRows in the middle of the process.

The method used by Oracle to compute Hybrid histogram depends on whether the data set is of a *“Natural Hybrid”* type or of an *“Adjusted Hybrid*” one. For the later type Oracle will take profit of the pre-process it has accomplished and transform the aborted top-frequency into a hybrid histogram. In the former case Oracle will use a bucket-frequency strategy.

The data set used in this article qualifies for a *“Natural Hybrid”* histogram. This is why the next section explains the method used by Oracle to collect this kind of hybrid histogram. The explanation of the “*Adjusted Hybrid*” method will be left to a separate article.

## The method

Oracle has computed a Hybrid histogram for the current data set using a bucket size of 5 (sample_size/num_buckets = 100/20), as shown below:

SQL> select column_name ,num_distinct ,num_buckets ,sample_size ,histogram from user_tab_col_statistics where table_name = 'T1' and column_name ='N1'; COLUMN_NAME NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM ------------ ------------ ----------- ----------- ----------- N1 37 20 100 HYBRID

At this stage of the article, let’s label this bucket size as the ** “original”** bucket size. The method used by Oracle to compute this kind of histogram seems to obey the following strategy:

- a distinct value should not be found in more than one bucket
- a bucket size is allowed to be extended in order to contain all instances of the same distinct value

As long as the bucket size can be extended so that no distinct value will span multiple buckets, deciphering the ** new** size of the extended buckets represents a crucial step in figuring out the method used by Oracle to compute Hybrid histogram. Below is the histogram information stored in the data dictionary:

SQL> SELECT (row_number() over(order by ept_nbr)-1) NumBucket ,ept_nbr ,ept_act_val ,rpt_cnt ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size" ,bucket_size "original bucket_size" FROM (SELECT ah.endpoint_number ept_nbr ,ah.endpoint_actual_value ept_act_val ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag ,ah.endpoint_repeat_count rpt_cnt ,at.sample_size/at.num_buckets bucket_size FROM user_tab_histograms ah ,user_tab_col_statistics at WHERE ah.table_name = at.table_name AND ah.column_name = at.column_name AND ah.table_name = 'T1' AND ah.column_name = 'N1' ) ORDER BY ept_nbr; NUMBUCKET EPT_NBR EPT_ACT_VA RPT_CNT new bucket size original bucket_size ---------- ---------- ---------- ---------- --------------- -------------------- 0 1 8 1 1 5 1 6 13 3 5 5 2 12 18 2 6 5 3 20 20 5 8 5 4 26 23 2 6 5 5 32 26 3 6 5 6 38 27 6 6 5 7 44 28 6 6 5 8 50 29 6 6 5 9 58 31 5 8 5 10 69 33 8 11 5 11 79 35 7 10 5 *** 12 86 38 5 7 5 13 90 41 1 4 5 14 92 42 2 2 5 15 95 43 3 3 5 16 96 44 1 1 5 17 97 45 1 1 5 18 98 46 1 1 5 19 100 59 1 2 5 20 rows selected.

Excluding the boundaries, and up to the 13^{th} endpoint number 86, all ** new** bucket sizes are

**greater or equal**to the

**bucket size. In order to have a clue about what’s happening behind the scenes, let’s draw an ordered picture of the data set divided into 20 buckets each of size 5 as shown below:**

*original*8 12 12 13 13 |13 15 16 16 17 |18 18 19 19 19|20 20 20 20 20|21 22 22 22 23|23 24 24 25 26| 26 26 27 27 27|27 27 27 28 28 |28 28 28 28 29|29 29 29 29 29|30 30 30 31 31|31 31 31 32 32| 32 33 33 33 33|33 33 33 33 34 |34 34 35 35 35|35 35 35 35 36|37 38 38 38 38|38 39 39 40 41| 42 42 43 43 43|44 45 46 50 59|

Oracle starts walking the above ordered data set until it finds that value **13 **traverses two buckets. Based on the first rule above, Oracle extends bucket n°1 by moving all instances of **13** from bucket_id n°2 into it, giving us these intermediary results:

`8 12 12 13 13 13| 15 16 16 17 |18 18 19 19 19|`

Note that 13 is not only an endpoint number of bucket n°1 but since it is uniquely located into a single bucket, Oracle is able to count its frequency of 3 (endpoint_repeat_count) as shown in the histogram information reproduced below:

NUMBUCKET EPT_NBR EPT_ACT_VA RPT_CNT new bucket size original bucket_size ---------- ---------- ---------- ---------- --------------- -------------------- 1 6 1335 5 2 12 18 2 6 5

However, the above histogram information is showing that bucket n°2 is of size 6 and has 18 as its endpoint number, with a frequency of 2. This is not in accordance with this picture:

`8 12 12 13 13 13| 15 16 16 17 |18 18 19 19 19|`

Looking at this new situation we can spot that when jumping from bucket n°2 to bucket n°3 there are no distinct values crossing multiple buckets. So we could keep this bucket with its new size of 4. This, however, is not the case. In fact, it appears that there is a third rule used by Oracle when generating “*Natural Hybrid*” histogram which is:

- do not create buckets with a size less than the original size (5) –not applicable at the end of the data set–

As long as the above rule is applied, with bucket n°2 being of size 4, it has to be extended to reach a size of 5. This is why the first instance from bucket n°3 (18) has to be moved into bucket n°2 as in the following:

8 12 12 13 13 13| 15 16 16 17 18| 18 19 19 19|

However, since there are two instances of 18 in bucket n°3 they have to move together to obey the first rule in the strategy: no distinct values spanning multiple buckets. This gives us the following situation:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19

As a result of this, we see that we do indeed have a bucket n°2 of size 6, with 18 as its endpoint number and a frequency of 2. We will see later in this article that it is this rule 3) of the may introduce instability in the Hybrid histogram.

Oracle will continue traversing the rest of the data set applying the three point strategy above until it finds that value **23** spans multiple buckets:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19 20 20 20 20 20|-|21 22 22 22 23|23 24 24 25 26|

Applying the same rule again, all instances of values **23** will be moved from bucket _id n°6 to bucket_id n°5, resulting in the following situation:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19 20 20 20 20 20|-|21 22 22 22 23 23| 24 24 25 26|

And so on until it reaches the end of the data set producing such a final histogram result:

8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20|-|21 22 22 22 23 23|24 24 25 26 26 26|27 27 27 27 27 27|28 28 28 28 28 28 |29 29 29 29 29 29|-|30 30 30 31 31 31 31 31|32 32 32 33 33 33 33 33 33 33 33|-|34 34 34 35 35 35 35 35 35 35|-|36 37 38 38 38 38 38|-|39 39 40 41|42 42 |43 43 43|44|45|46|50 59|

You will notice that, starting from the thirteenth buckets at value 39, the bucket extension rule and minimum size requirement seem to be ignored. At this stage of the data set Oracle is left with 14 distinct values to spread over 8 buckets. This last phrase paved the way to what seems to be the fourth rule applied by Oracle when gathering *Natural Hybrid* histogram:

- the original number of buckets (20) should not be reduced

Instead of reducing the original number of buckets (20), Oracle seems to divide the number of remaining distinct values (14) by the number of remaining buckets (8) and spread them accordingly. Unfortunately, I didn’t succeed in working out the exact strategy used by Oracle in this last step to spread the data across the remaining buckets, rather than the one which consists of keeping the initial number of buckets unchanged.

## Popular or non-popular?

Oracle developers and tuning specialists are interested in having optimal execution plans. When the Optimizer is supplied with accurate statistics it will mostly come up with an optimal execution plan. However, when the optimizer uses a histogram to get its cardinality estimate, the arithmetic it employs for a popular value differs from that used for a non-popular value. This is why identifying the popularity of a column value represents a crucial step in the tuning process. This section of the article shows when a column with a hybrid histogram is considered popular or not and shows the arithmetic used by Oracle to compute the cardinality estimate for each case.

When it comes to a hybrid histogram Oracle considers a value to be popular when its recorded endpoint_repeat_count value exceeds the *original* bucket size. The endpoint_repeat_count is a new 12c histogram column property in which Oracle stores the number of times a distinct value has been repeated in a bucket. Let’s put this in a SQL format:

SQL> select uth.endpoint_number ,uth.endpoint_actual_value ,uth.endpoint_repeat_count ,ucs.sample_size/ucs.num_buckets bucket_size ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity from user_tab_histograms uth ,user_tab_col_statistics ucs where uth.table_name = ucs.table_name and uth.column_name = ucs.column_name and uth.table_name = 'T1' and uth.column_name = 'N1' order by uth.endpoint_number; ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY --------------- ---------- --------------------- ----------- ---------- 1 8 1 5 -4 6 13 3 5 -2 12 18 2 5 -3 20 20 5 5 0 26 23 2 5 -3 32 26 3 5 -2 38 27 6 5 1 44 28 6 5 1 50 29 6 5 1 58 31 5 5 0 69 33 8 5 3 79 35 7 5 2 86 38 5 5 0 90 41 1 5 -4 92 42 2 5 -3 95 43 3 5 -2 96 44 1 5 -4 97 45 1 5 -4 98 46 1 5 -4 100 59 1 5 -4 20 rows selected.

The **Popularity** column above represents the difference between the frequency of appearance of each column and the bucket size (sample_size/num_buckets). It can easily be inferred from this column that a value is considered popular when its **Popularity** column value is greater than zero. Refactoring the above query a little gives this:

SQL> select endpoint_number ,endpoint_actual_value ,endpoint_repeat_count ,bucket_size ,case when Popularity > 0 then 'Pop' else 'Non-Pop' end Popularity from ( select uth.endpoint_number ,uth.endpoint_actual_value ,uth.endpoint_repeat_count ,ucs.sample_size/ucs.num_buckets bucket_size ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity from user_tab_histograms uth ,user_tab_col_statistics ucs where uth.table_name = ucs.table_name and uth.column_name = ucs.column_name and uth.table_name = 'T1' and uth.column_name = 'N1' ) order by endpoint_number; ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR --------------- ---------- --------------------- ----------- ------- 1 8 1 5 Non-Pop 6 13 3 5 Non-Pop 12 18 2 5 Non-Pop 20 20 5 5 Non-Pop 26 23 2 5 Non-Pop 32 26 3 5 Non-Pop 38 27 6 5 Pop 44 28 6 5 Pop 50 29 6 5 Pop 58 31 5 5 Non-Pop 69 33 8 5 Pop 79 35 7 5 Pop 86 38 5 5 Non-Pop 90 41 1 5 Non-Pop 92 42 2 5 Non-Pop 95 43 3 5 Non-Pop 96 44 1 5 Non-Pop 97 45 1 5 Non-Pop 98 46 1 5 Non-Pop 100 59 1 5 Non-Pop 20 rows selected.

Simply put, a captured Hybrid histogram value is considered to be:

- popular when its endpoint_repeat_count is greater than the
**original**bucket size - non-popular with a recorded endpoint number in the histogram
- non-popular and absent from the histogram information

Thus, in contrast to its predecessor, the Height Balance histogram, it’s not sufficient for a Hybrid value to be captured into the histogram with an endpoint_repeat_count greater than 1 in order for it to be considered popular.

## Cardinality estimation

As stated above the popularity of Hybrid histogram value dictates the arithmetic used by Oracle to calculate its estimation. The preceding section illustrates three types of Hybrid histogram values:

- popular value
- non-popular value with an endpoint number
- non-popular value not present in the histogram table

Let’s start by figuring out how Oracle estimates the cardinality of the first type:

### Popular value

Below is a list of popular values in the current data set:

```
SQL> select
endpoint_actual_value
,endpoint_repeat_count
,bucket_size
,Popularity
from
(
select
endpoint_number
,endpoint_actual_value
,endpoint_repeat_count
,bucket_size
,case when Popularity > 0 then 'Pop'
else 'Non-Pop'
end Popularity
from
(
select
uth.endpoint_number
,uth.endpoint_actual_value
,uth.endpoint_repeat_count
,ucs.sample_size/ucs.num_buckets bucket_size
,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
from
user_tab_histograms uth
,user_tab_col_statistics ucs
where
uth.table_name = ucs.table_name
and uth.column_name = ucs.column_name
and uth.table_name = 'T1'
and uth.column_name = 'N1'
)
)
where Popularity = 'Pop';
ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
---------- --------------------- ----------- -------
27 6 5 Pop
28 6 5 Pop
29 6 5 Pop
33 8 5 Pop
35 7 5 Pop
5 rows selected.
```

Let’s get the Oracle optimizer cardinality estimation using one of the above values:

SQL> explain plan for select count(1) from t1 where n1 = 33; SQL> select * from table(dbms_xplan.display); ------------------------------------------- | Id | Operation | Name | Rows | ------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | SORT AGGREGATE | | 1 | |* 2 | TABLE ACCESS FULL| T1 | 8 | ------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("N1"=33)

In order to get the above cardinality estimation, Oracle uses the following formula:

```
E-Rows (12c) = ENDPOINT_REPEAT_COUNT * num_rows/sample_size
E-Rows (12c) = 8 * 100/100 = 8
```

### Non-Popular value with an endpoint number

For the data set considered in this article, the following values are non-popular values having an endpoint number:

SQL> select endpoint_actual_value ,endpoint_repeat_count ,bucket_size ,Popularity from ( select endpoint_number ,endpoint_actual_value ,endpoint_repeat_count ,bucket_size ,case when Popularity > 0 then 'Pop' else 'Non-Pop' end Popularity from ( select uth.endpoint_number ,uth.endpoint_actual_value ,uth.endpoint_repeat_count ,ucs.sample_size/ucs.num_buckets bucket_size ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity from user_tab_histograms uth ,user_tab_col_statistics ucs where uth.table_name = ucs.table_name and uth.column_name = ucs.column_name and uth.table_name = 'T1' and uth.column_name = 'N1' ) ) where Popularity = 'Non-Pop'; ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR ---------- --------------------- ----------- ------- 8 1 5 Non-Pop 13 3 5 Non-Pop 18 2 5 Non-Pop 20 5 5 Non-Pop 23 2 5 Non-Pop 26 3 5 Non-Pop 31 5 5 Non-Pop 38 5 5 Non-Pop 41 1 5 Non-Pop 42 2 5 Non-Pop 43 3 5 Non-Pop 44 1 5 Non-Pop 45 1 5 Non-Pop 46 1 5 Non-Pop 59 1 5 Non-Pop 15 rows selected.

Let’s then get the Oracle optimizer cardinality estimation of one of the above non-popular values:

SQL> explain plan for select count(1) from t1 where n1 = 45; SQL> select * from table(dbms_xplan.display); -------------------------------------------- | Id | Operation | Name | Rows | -------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | SORT AGGREGATE | | 1 | |* 2 | TABLE ACCESS FULL| T1 | 2 | -------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("N1"=45)

In contrast to what you might probably guess, the optimizer calculates the cardinality estimates for a non-popular value having endpoint number using the following different formula:

E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)

The optimizer calculates the *NewDensity* using an internal algorithm based on factors like the number of distinct values, the number of popular buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by Alberto Dell’ Era) we can get a reliable value for the NewDensity:

SELECT BktCnt ,PopBktCnt ,PopValCnt ,NDV ,pop_bucketSize ,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity FROM (SELECT COUNT(1) PopValCnt, SUM(endpoint_repeat_count) PopBktCnt, ndv, BktCnt, pop_bucketSize FROM (SELECT (sample_size - num_nulls) BktCnt, num_distinct ndv, num_buckets, density OldDensity, (sample_size-num_nulls)/num_buckets pop_bucketSize FROM user_tab_col_statistics WHERE table_name = 'T1' AND column_name = 'N1' ), user_histograms WHERE table_name = 'T1' AND column_name = 'N1' AND endpoint_repeat_count> pop_bucketSize GROUP BY ndv, BktCnt, pop_bucketSize ); BKTCNT POPBKTCNT POPVALCNT NDV POP_BUCKETSIZE NEWDENSITY ---------- ---------- ---------- ---------- -------------- ---------- 100 33 5 37 5 .0209375

Using this computed *NewDensity* in the above cardinality estimation formula gives this:

E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size) E-Rows (12c) = 100 * greatest (.0209375, 1/100) = 2.09375 ~ 2

The corresponding 10053 CBO trace backs the above arithmetic as shown in the following:

===================================== Access path analysis for T1 *************************************** SINGLE TABLE ACCESS PATH Single Table Cardinality Estimation for T1[T1] SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE Column (#2): NewDensity: 0.020938, OldDensity: 0.026167 BktCnt: 100, PopBktCnt: 33, PopValCnt: 5, NDV: 37 Column (#2): N1(NUMBER) AvgLen: 3 NDV: 37 Nulls: 0 Density: 0.020938 Min: 8.000000 Max: 59.000000 Histogram: Hybrid #Bkts: 20 UncompBkts: 100 EndPtVals: 20 ActualVal: yes Table: T1 Alias: T1 Card: Original: 100. Rounded: 2 Computed: 2.093750 Non Adjusted: 2.093750 Best:: AccessPath: TableScan Cost: 3.001739 Degree: 1 Resp: 3.001739 Card: 2.093750 Bytes: 0.000000

Note how this trace file shows the same information (`NewDensity, BktCnt, PopBktCnt, PopValCnt`

and `NDV`

) we have selected above when calculating the NewDensity. The same trace file also shows that the computed cardinality is exactly equal to the one we’ve calculated using the supplied formula.

In order to give more credit to this formula let’s apply it to get the cardinality estimation of another non-popular value having an endpoint_repeat_count of 3 which is greater than 1 as shown below:

SQL> explain plan for select count(1) from t1 where n1 = 43; SQL> select * from table(dbms_xplan.display); ------------------------------------------- | Id | Operation | Name | Rows | ------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | SORT AGGREGATE | | 1 | |* 2 | TABLE ACCESS FULL| T1 | 3 | ------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("N1"=43)

E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size) E-Rows (12c) = 100 * greatest (.0209375, 3/100) = 3

### Non-Popular value without an endpoint number

In order to calculate the cardinality estimates of a hybrid column value that has not been captured into the histogram, the Optimizer uses the following formula:

`E-Rows (12c) = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ 2`

The code below illustrates this formula applied to one of the non-captured values:

SQL> explain plan for select count(1) from t1 where n1 = 17; SQL> select * from table(dbms_xplan.display); ------------------------------------------- | Id | Operation | Name | Rows | ------------------------------------------- | 0 | SELECT STATEMENT | | 1 | | 1 | SORT AGGREGATE | | 1 | |* 2 | TABLE ACCESS FULL| T1 | 2 | ------------------------------------------- Predicate Information (identified by operation id): --------------------------------------------------- 2 - filter("N1"=17)

As the above formula suggests, **all** non-popular hybrid histogram values that have not been captured will **always** have the **same constant** cardinality estimation whatever their real count is in the table.

## Instability

We stated above that when applying rule 3) to ensure the minimum bucket size of 5 when building its Hybrid histogram, Oracle introduces a certain instability. With this actual data set of 100 rows and 37 distinct values, Oracle ended up by making the value 17 as non-popular without an endpoint and 18 as an endpoint number with a endpoint_repeat_count of 2 as shown below:

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR --------------- ---------- --------------------- ----------- ------- 1 8 1 5 Non-Pop 6 13 3 5 Non-Pop 12 18 2 5 Non-Pop

If we add a new instance of 16 the reverse situation will happen:

SQL> insert into t1 values (2, 16); SQL> BEGIN dbms_stats.gather_table_stats (user, 't1', method_opt => 'for all columns size 20'); END; / SQL> select uth.endpoint_number ,uth.endpoint_actual_value ,uth.endpoint_repeat_count from user_tab_histograms uth ,user_tables ut ,user_tab_col_statistics ucs where uth.table_name = 'T1' and uth.column_name = 'N1' and uth.table_name = ut.table_name and ut.table_name = ucs.table_name and uth.column_name = ucs.column_name order by uth.endpoint_number; ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT --------------- ---------- --------------------- 1 8 1 6 13 3 11 17 1 *** 16 19 3 21 20 5 ... 99 46 1 101 59 1 20 rows selected.

See how “18” has been excluded from the histogram information while “17” has been incorporated into it. This new situation is the result of the following new data set organization:

8 12 12 13 13|13 15 16 16 16|17 18 18 19 19 19

Which, when passed through the Hybrid histogram crunching strategy gives this:

8 12 12 13 13 13| 15 16 16 16 17 |18 18 19 19 19

Add an extra instance of ‘16′ and both 17 and 18 will quit the histogram information while ‘16′ will be recorded with a correct frequency:

SQL> insert into t1 values (3, 16); SQL> BEGIN dbms_stats.gather_table_stats (user, 't1', method_opt => 'for all columns size 20'); END; / ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT --------------- ---------- --------------------- 1 8 1 6 13 3 11 16 4 *** 17 19 3 22 20 5 ... 100 46 1 102 59 1 20 rows selected.

## Conclusion

Frequency histograms, provided you’ve collected them at the right moment using a representative sampling, can help the optimizer in getting correct cardinality estimates. However, their internal upper bound limit of 254 distinct values drastically limits their usefulness in favor of the imprecise height balanced histogram which needs, for a distinct value to be properly captured, to have one of its instances appearing into at least two endpoints of two distinct buckets. Oracle 12cR1 has not only extended the internal limit of frequency histogram to 2048 but has implemented two new types of histogram: Top-Frequency and Hybrid. The later histogram type combines the best attributes of its two predecessors; that is the accuracy of frequency histogram by counting the number of times a distinct value repeats itself and the bucket saving space of the height balanced histogram by collapsing all instances of the same distinct value into a single bucket without increasing the original bucket number. The hybrid histogram generates more popular values and therefore a greater chance for the Oracle Optimizer to come up with an optimal execution plan.

Code for the examples: InsT1

## 1 Comment

Nenad Noveljic

22/11/2016

Hello Mohamed,

Thank you for sharing this information. Your blog post helped me a lot at troubleshooting wrong cardinality estimations with hybrid histograms.

However, I noticed that the queries don't return accurate results when applied to data distributions with strong peaks. In my opinion, the new density formula has changed with the introduction of hybrid histograms. What I mean by that is the expression (BktCnt-PopBktCnt)/BktCnt can't be used any more for calculating the fraction of unpopular values, because the buckets are of uneven sizes. Therefore, I described the alternative method for calculating the density <a href="http://nenadnoveljic.com/blog/density-calculation-hybrid-histograms/" target="_blank">here</a>

Kind regards,

Nenad

No trackbacks yet.