In the previous article about the Oracle 12c TOP-Frequency histogram, we saw that Oracle carries out a preliminary determination step to see how dominant the TOP-N values are with regards to the total number of rows in the data set. I demonstrated that a TOP-Frequency is possible provided a threshold condition is satisfied. This second article explores the Hybrid histogram type which occurs when the data set fails to qualify for a TOP-Frequency. Firstly, it shows that the method used by Oracle to compute a Hybrid histogram is strongly related to the TOP-Frequency preliminary determination step. It then shows that a column value can be popular, non-popular with an endpoint value or non-popular without an endpoint value. It then examines the formulas used by the optimizer to compute its cardinality estimation for each of the above three cases.

Natural or Adjusted?

In order to investigate the method used by Oracle to calculate a Hybrid histogram, I am going to use a data set example taken from Jonathan Lewis‘ blog as shown below:

SQL> create table t1 (id number, n1 number);

SQL> start InsT1.sqlScript (see downloadable script at the end)

In the following, I am going to gather Hybrid histogram in 12c (12.1.0.2.0) by specifying fewer buckets (20) than distinct value (37) for the current data set and using default sampling:

SQL> exec dbms_stats.set_global_prefs ('TRACE', to_char (1+16)); 
SQL> BEGIN
        dbms_stats.gather_table_stats
          (user, 't1', method_opt => 'for columns n1 size 20');
     END;
    /
SQL> exec dbms_stats.set_global_prefs('TRACE', null);

As in the previous article, the above call to dbms_stats package has been traced so that we can explore its content in the following snippet:

DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:   Y    Y    Y    Y    Y    Y    Y    Y                   Y    N1
DBMS_STATS: Approximate NDV Options
DBMS_STATS: ACL,TOPN,NIL,NIL,RWID,U,U20U
DBMS_STATS: start processing top n values for column "N1"
DBMS_STATS:   >> frequency histograms is not feasible
                       (topn_values is null), skip!
DBMS_STATS: Iteration: 1 numhist: 1
DBMS_STATS:  NNV  NDV  AVG  MMX  HST  EP   RP   NNNP IND  CNDV HSTN HSTR  COLNAME
DBMS_STATS:                       Y    Y    Y    Y                   Y    N1
DBMS_STATS: Building Histogram for N1
DBMS_STATS:  bktnum=20, nnv=100, snnv=100, sndv=37, est_ndv=37, mnb=20
DBMS_STATS:  Trying hybrid histogram

As expected, Oracle started by carrying out the usual TOP-Frequency preliminary determination step. It is worth emphasizing that this data set fails to immediately qualify for a TOP-Frequency histogram because the TOP-20 values account for 80 rows while the threshold is 95 rows, as shown below:

SQL> select
         sum (cnt) TopNRows
    from (select
            n1
           ,count(*) cnt
         from t1
         group by n1
          order by count(*) desc
          )
   where rownum <= 20;

  TOPNROWS
----------
        80

SQL> select round ((20-1)/20 * 100) threshold from dual;

 THRESHOLD
----------
        95

This is what I tend to qualify as a “Natural Hybrid” histogram representing a data set which doesn’t qualify for a TOP-Frequency because the threshold is greater than the naturally non-adjusted TopNRows, as explained in the previous article. However, what happens for a data set that initially satisfies the TOP-Frequency threshold but fails to qualify for a TOP-Frequency histogram because Oracle finds at the middle of the process that the Adjusted TopNRows is greater than the threshold. A dbms_stats trace file of such a kind of data set shows a different content:

DBMS_STATS: start processing top n values for column "N1"
DBMS_STATS:   >> frequency histograms is not feasible
                       (topn_values is null), skip!
DBMS_STATS: start processing top n values for column "N2"
DBMS_STATS: topn sql (len: 589):
DBMS_STATS: +++ select /*+  no_parallel(t) no_parallel_index(t) 
                           dbms_stats cursor_sharing_exact use_weak_name_resl
                           dynamic_sampling(0) no_monitoring 
				 xmlindex_sel_idx_tbl no_substrb_pad  
                        */ 
	           substrb(dump("N2",16,0,64),1,240) val
                 ,rowidtochar(rowid) rwid 
	          from "XXXX"."T1" t 
		    where rowid in  
              (chartorowid('AAAJmWAAEAACN1zAAA'),chartorowid('AAAJmWAAEAACN1zAAB')
	        ,chartorowid('AAAJmWAAEAACN1zAAC'),chartorowid('AAAJmWAAEAACN1zAAF' 
	        ,chartorowid('AAAJmWAAEAACN1zAAG'),chartorowid('AAAJmWAAEAACN1zAAH')
	        ,chartorowid('AAAJmWAAEAACN1zAAJ'),chartorowid('AAAJmWAAEAACN1zAAU')) 
	        order by "N2"	   
DBMS_STATS: remove last bucket: Typ=2 Len=2: c1,2b add: Typ=2 Len=3: c2,2,1a
DBMS_STATS: removal_count: 1 total_nonnull_rows: 200 mnb:  8
DBMS_STATS: Abort top-n histogram, as the addition of min/max does not preserve the minimum coverage: .85225054 vs. .91
DBMS_STATS:  Building Histogram for N2
DBMS_STATS:  Trying hybrid histogram

While in the first case Oracle immediately abandoned the Top-Frequency path, in the second case it went a little bit deeper before aborting the top-frequency process in favor of a Hybrid histogram. This is what I tend to qualify as the “Adjusted Hybrid” histogram representing a data set that fails to qualify for a TOP-Frequency because Oracle finds that the threshold is greater than the Adjusted TopNRows in the middle of the process.

The method used by Oracle to compute Hybrid histogram depends on whether the data set is of a “Natural Hybrid” type or of an “Adjusted Hybrid” one. For the later type Oracle will take profit of the pre-process it has accomplished and transform the aborted top-frequency into a hybrid histogram. In the former case Oracle will use a bucket-frequency strategy.

The data set used in this article qualifies for a “Natural Hybrid” histogram. This is why the next section explains the method used by Oracle to collect this kind of hybrid histogram. The explanation of the “Adjusted Hybrid” method will be left to a separate article.

The method

Oracle has computed a Hybrid histogram for the current data set using a bucket size of 5 (sample_size/num_buckets = 100/20), as shown below:

SQL> select
       column_name
      ,num_distinct
      ,num_buckets
      ,sample_size
      ,histogram
    from
       user_tab_col_statistics
    where table_name = 'T1'
    and column_name  ='N1';

COLUMN_NAME  NUM_DISTINCT NUM_BUCKETS SAMPLE_SIZE HISTOGRAM
------------ ------------ ----------- ----------- -----------
N1                  37          20         100     HYBRID

At this stage of the article, let’s label this bucket size as the “original” bucket size. The method used by Oracle to compute this kind of histogram seems to obey the following strategy:

  1. a distinct value should not be found in more than one bucket
  2. a bucket size is allowed to be extended in order to contain all instances of the same distinct value

As long as the bucket size can be extended so that no distinct value will span multiple buckets, deciphering the new size of the extended buckets represents a crucial step in figuring out the method used by Oracle to compute Hybrid histogram. Below is the histogram information stored in the data dictionary:

SQL> SELECT
        (row_number() over(order by ept_nbr)-1) NumBucket
        ,ept_nbr
        ,ept_act_val
        ,rpt_cnt
        ,ept_nbr - (lag(ept_nbr,1,0) over(order by ept_nbr)) "new bucket size"
        ,bucket_size "original bucket_size"
    FROM
        (SELECT
             ah.endpoint_number            ept_nbr
            ,ah.endpoint_actual_value      ept_act_val
            ,lag(ah.endpoint_number,1,0) over(order by ah.endpoint_number) ept_lag
            ,ah.endpoint_repeat_count rpt_cnt
            ,at.sample_size/at.num_buckets bucket_size
         FROM
            user_tab_histograms      ah
           ,user_tab_col_statistics  at
         WHERE ah.table_name  = at.table_name
         AND ah.column_name = at.column_name
         AND ah.table_name  = 'T1'
         AND ah.column_name = 'N1'
       ) ORDER BY ept_nbr;

 NUMBUCKET    EPT_NBR EPT_ACT_VA    RPT_CNT new bucket size original bucket_size
---------- ---------- ---------- ---------- --------------- --------------------
         0          1 8                   1               1                    5
         1          6 13                  3               5                    5
         2         12 18                  2               6                    5
         3         20 20                  5               8                    5
         4         26 23                  2               6                    5
         5         32 26                  3               6                    5
         6         38 27                  6               6                    5
         7         44 28                  6               6                    5
         8         50 29                  6               6                    5
         9         58 31                  5               8                    5
        10         69 33                  8              11                    5
        11         79 35                  7              10                    5
***     12         86 38                  5               7                    5
        13         90 41                  1               4                    5
        14         92 42                  2               2                    5
        15         95 43                  3               3                    5
        16         96 44                  1               1                    5
        17         97 45                  1               1                    5
        18         98 46                  1               1                    5
        19        100 59                  1               2                    5 
20 rows selected.

Excluding the boundaries, and up to the 13th endpoint number 86, all new bucket sizes are greater or equal to the original bucket size. In order to have a clue about what’s happening behind the scenes, let’s draw an ordered picture of the data set divided into 20 buckets each of size 5 as shown below:

8 12 12 13 13 |13 15 16 16 17 |18 18 19 19 19|20 20 20 20 20|21 22 22 22 23|23 24 24 25 26|

26 26 27 27 27|27 27 27 28 28 |28 28 28 28 29|29 29 29 29 29|30 30 30 31 31|31 31 31 32 32|

32 33 33 33 33|33 33 33 33 34 |34 34 35 35 35|35 35 35 35 36|37 38 38 38 38|38 39 39 40 41|

42 42 43 43 43|44 45 46 50 59|

Oracle starts walking the above ordered data set until it finds that value 13 traverses two buckets. Based on the first rule above, Oracle extends bucket n°1 by moving all instances of 13 from bucket_id n°2 into it, giving us these intermediary results:

8 12 12 13 13 13| 15 16 16 17 |18 18 19 19 19|

Note that 13 is not only an endpoint number of bucket n°1 but since it is uniquely located into a single bucket, Oracle is able to count its frequency of 3 (endpoint_repeat_count) as shown in the histogram information reproduced below:

NUMBUCKET    EPT_NBR EPT_ACT_VA    RPT_CNT new bucket size original bucket_size
---------- ---------- ---------- ---------- --------------- --------------------
         1          6 13                  3               5                    5
         2         12 18                  2               6                    5

However, the above histogram information is showing that bucket n°2 is of size 6 and has 18 as its endpoint number, with a frequency of 2. This is not in accordance with this picture:

8 12 12 13 13 13| 15 16 16 17 |18 18 19 19 19|

Looking at this new situation we can spot that when jumping from bucket n°2 to bucket n°3 there are no distinct values crossing multiple buckets. So we could keep this bucket with its new size of 4. This, however, is not the case. In fact, it appears that there is a third rule used by Oracle when generating “Natural Hybrid” histogram which is:

  1. do not create buckets with a size less than the original size (5) –not applicable at the end of the data set–

As long as the above rule is applied, with bucket n°2 being of size 4, it has to be extended to reach a size of 5. This is why the first instance from bucket n°3 (18) has to be moved into bucket n°2 as in the following:

8 12 12 13 13 13| 15 16 16 17 18| 18 19 19 19|

However, since there are two instances of 18 in bucket n°3 they have to move together to obey the first rule in the strategy: no distinct values spanning multiple buckets. This gives us the following situation:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19

As a result of this, we see that we do indeed have a bucket n°2 of size 6, with 18 as its endpoint number and a frequency of 2. We will see later in this article that it is this rule 3) of the may introduce instability in the Hybrid histogram.

Oracle will continue traversing the rest of the data set applying the three point strategy above until it finds that value 23 spans multiple buckets:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19 20 20 20 20 20|-|21 22 22 22 23|23 24 24 25 26|

Applying the same rule again, all instances of values 23 will be moved from bucket _id n°6 to bucket_id n°5, resulting in the following situation:

8 12 12 13 13 13| 15 16 16 17 18 18| 19 19 19 20 20 20 20 20|-|21 22 22 22 23 23| 24 24 25 26|

And so on until it reaches the end of the data set producing such a final histogram result:

8 12 12 13 13 13|15 16 16 17 18 18|19 19 19 20 20 20 20 20|-|21 22 22 22 23 23|24 24 25 26 26 26|27 27 27 27 27 27|28 28 28 28 28 28 |29 29 29 29 29 29|-|30 30 30 31 31 31 31 31|32 32 32 33 33 33 33 33 33 33 33|-|34 34 34 35 35 35 35 35 35 35|-|36 37 38 38 38 38 38|-|39 39 40 41|42 42 |43 43 43|44|45|46|50 59|

You will notice that, starting from the thirteenth buckets at value 39, the bucket extension rule and minimum size requirement seem to be ignored. At this stage of the data set Oracle is left with 14 distinct values to spread over 8 buckets. This last phrase paved the way to what seems to be the fourth rule applied by Oracle when gathering Natural Hybrid histogram:

  1. the original number of buckets (20) should not be reduced

Instead of reducing the original number of buckets (20), Oracle seems to divide the number of remaining distinct values (14) by the number of remaining buckets (8) and spread them accordingly. Unfortunately, I didn’t succeed in working out the exact strategy used by Oracle in this last step to spread the data across the remaining buckets, rather than the one which consists of keeping the initial number of buckets unchanged.

Popular or non-popular?

Oracle developers and tuning specialists are interested in having optimal execution plans. When the Optimizer is supplied with accurate statistics it will mostly come up with an optimal execution plan. However, when the optimizer uses a histogram to get its cardinality estimate, the arithmetic it employs for a popular value differs from that used for a non-popular value. This is why identifying the popularity of a column value represents a crucial step in the tuning process. This section of the article shows when a column with a hybrid histogram is considered popular or not and shows the arithmetic used by Oracle to compute the cardinality estimate for each case.

When it comes to a hybrid histogram Oracle considers a value to be popular when its recorded endpoint_repeat_count value exceeds the original bucket size. The endpoint_repeat_count is a new 12c histogram column property in which Oracle stores the number of times a distinct value has been repeated in a bucket. Let’s put this in a SQL format:

SQL> select
         uth.endpoint_number
        ,uth.endpoint_actual_value
        ,uth.endpoint_repeat_count
        ,ucs.sample_size/ucs.num_buckets bucket_size      
        ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
    from
        user_tab_histograms uth
       ,user_tab_col_statistics ucs
   where
        uth.table_name   = ucs.table_name
    and uth.column_name   = ucs.column_name
    and uth.table_name    = 'T1'
    and uth.column_name   = 'N1'
   order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULARITY
--------------- ---------- --------------------- ----------- ----------
              1 8                              1           5         -4
              6 13                             3           5         -2
             12 18                             2           5         -3
             20 20                             5           5          0
             26 23                             2           5         -3
             32 26                             3           5         -2
             38 27                             6           5          1
             44 28                             6           5          1
             50 29                             6           5          1
             58 31                             5           5          0
             69 33                             8           5          3
             79 35                             7           5          2
             86 38                             5           5          0
             90 41                             1           5         -4
             92 42                             2           5         -3
             95 43                             3           5         -2
             96 44                             1           5         -4
             97 45                             1           5         -4
             98 46                             1           5         -4
            100 59                             1           5         -4
20 rows selected.

The Popularity column above represents the difference between the frequency of appearance of each column and the bucket size (sample_size/num_buckets). It can easily be inferred from this column that a value is considered popular when its Popularity column value is greater than zero. Refactoring the above query a little gives this:

SQL> select
         endpoint_number
        ,endpoint_actual_value
        ,endpoint_repeat_count
        ,bucket_size
        ,case when Popularity > 0 then 'Pop'
                   else 'Non-Pop'
          end Popularity
    from
   (
     select
         uth.endpoint_number
        ,uth.endpoint_actual_value
        ,uth.endpoint_repeat_count
        ,ucs.sample_size/ucs.num_buckets bucket_size      
        ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
    from
        user_tab_histograms uth
       ,user_tab_col_statistics ucs
   where
        uth.table_name   = ucs.table_name
    and uth.column_name   = ucs.column_name
    and uth.table_name    = 'T1'
    and uth.column_name   = 'N1'
    )
   order by endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ---------- --------------------- ----------- -------
              1 8                              1           5 Non-Pop
              6 13                             3           5 Non-Pop
             12 18                             2           5 Non-Pop
             20 20                             5           5 Non-Pop
             26 23                             2           5 Non-Pop
             32 26                             3           5 Non-Pop
             38 27                             6           5 Pop
             44 28                             6           5 Pop
             50 29                             6           5 Pop
             58 31                             5           5 Non-Pop
             69 33                             8           5 Pop
             79 35                             7           5 Pop
             86 38                             5           5 Non-Pop
             90 41                             1           5 Non-Pop
             92 42                             2           5 Non-Pop
             95 43                             3           5 Non-Pop
             96 44                             1           5 Non-Pop
             97 45                             1           5 Non-Pop
             98 46                             1           5 Non-Pop
            100 59                             1           5 Non-Pop
20 rows selected.

Simply put, a captured Hybrid histogram value is considered to be:

  • popular when its endpoint_repeat_count is greater than the original bucket size
  • non-popular with a recorded endpoint number in the histogram
  • non-popular and absent from the histogram information

Thus, in contrast to its predecessor, the Height Balance histogram, it’s not sufficient for a Hybrid value to be captured into the histogram with an endpoint_repeat_count greater than 1 in order for it to be considered popular.

Cardinality estimation

As stated above the popularity of Hybrid histogram value dictates the arithmetic used by Oracle to calculate its estimation. The preceding section illustrates three types of Hybrid histogram values:

  • popular value
  • non-popular value with an endpoint number
  • non-popular value not present in the histogram table

Let’s start by figuring out how Oracle estimates the cardinality of the first type:

Popular value

Below is a list of popular values in the current data set:

SQL> select
         endpoint_actual_value
        ,endpoint_repeat_count
        ,bucket_size
        ,Popularity
from
(
  select
         endpoint_number
        ,endpoint_actual_value
        ,endpoint_repeat_count
        ,bucket_size
        ,case when Popularity > 0 then 'Pop'
                   else 'Non-Pop'
          end Popularity
    from
   (
     select
         uth.endpoint_number
        ,uth.endpoint_actual_value
        ,uth.endpoint_repeat_count
        ,ucs.sample_size/ucs.num_buckets bucket_size      
        ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
    from
        user_tab_histograms uth
       ,user_tab_col_statistics ucs
   where
        uth.table_name   = ucs.table_name
    and uth.column_name   = ucs.column_name
    and uth.table_name    = 'T1'
    and uth.column_name   = 'N1'
    )
  )
where Popularity = 'Pop';

ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
---------- --------------------- ----------- -------
27                             6           5 Pop
28                             6           5 Pop
29                             6           5 Pop
33                             8           5 Pop
35                             7           5 Pop

5 rows selected.

Let’s get the Oracle optimizer cardinality estimation using one of the above values:

SQL> explain plan for select count(1) from t1 where n1 = 33;

SQL> select * from table(dbms_xplan.display);
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |
|   1 |  SORT AGGREGATE    |      |     1 |
|*  2 |   TABLE ACCESS FULL| T1   |     8 |
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=33)

In order to get the above cardinality estimation, Oracle uses the following formula:

E-Rows (12c) = ENDPOINT_REPEAT_COUNT * num_rows/sample_size
E-Rows (12c) = 8 * 100/100 = 8

Non-Popular value with an endpoint number

For the data set considered in this article, the following values are non-popular values having an endpoint number:

SQL> select
         endpoint_actual_value
        ,endpoint_repeat_count
        ,bucket_size
        ,Popularity
from
(
  select
         endpoint_number
        ,endpoint_actual_value
        ,endpoint_repeat_count
        ,bucket_size
        ,case when Popularity > 0 then 'Pop'
                   else 'Non-Pop'
          end Popularity
    from
   (
     select
         uth.endpoint_number
        ,uth.endpoint_actual_value
        ,uth.endpoint_repeat_count
        ,ucs.sample_size/ucs.num_buckets bucket_size      
        ,(uth.endpoint_repeat_count - ucs.sample_size/ucs.num_buckets) Popularity
    from
        user_tab_histograms uth
       ,user_tab_col_statistics ucs
   where
        uth.table_name   = ucs.table_name
    and uth.column_name   = ucs.column_name
    and uth.table_name    = 'T1'
    and uth.column_name   = 'N1'
    )
  )
where Popularity = 'Non-Pop';

ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
---------- --------------------- ----------- -------
8                              1           5 Non-Pop
13                             3           5 Non-Pop
18                             2           5 Non-Pop
20                             5           5 Non-Pop
23                             2           5 Non-Pop
26                             3           5 Non-Pop
31                             5           5 Non-Pop
38                             5           5 Non-Pop
41                             1           5 Non-Pop
42                             2           5 Non-Pop
43                             3           5 Non-Pop
44                             1           5 Non-Pop
45                             1           5 Non-Pop
46                             1           5 Non-Pop
59                             1           5 Non-Pop
15 rows selected.

Let’s then get the Oracle optimizer cardinality estimation of one of the above non-popular values:

SQL> explain plan for select count(1) from t1 where n1 = 45;

SQL> select * from table(dbms_xplan.display);
--------------------------------------------
| Id  | Operation          | Name | Rows  |
--------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |
|   1 |  SORT AGGREGATE    |      |     1 |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |
--------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=45)

In contrast to what you might probably guess, the optimizer calculates the cardinality estimates for a non-popular value having endpoint number using the following different formula:

E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/sample_size)

The optimizer calculates the NewDensity using an internal algorithm based on factors like the number of distinct values, the number of popular buckets, the number of popular values, etc. Nevertheless, using the following select (inspired by Alberto Dell’ Era) we can get a reliable value for the NewDensity:

SELECT
         BktCnt
        ,PopBktCnt
        ,PopValCnt
        ,NDV
        ,pop_bucketSize
        ,trunc(((BktCnt-PopBktCnt)/BktCnt)/(NDV-PopValCnt),10) NewDensity
     FROM
        (SELECT
           COUNT(1) PopValCnt,
           SUM(endpoint_repeat_count) PopBktCnt,
           ndv,
           BktCnt,
           pop_bucketSize
         FROM
          (SELECT
            (sample_size - num_nulls) BktCnt,
            num_distinct ndv,
            num_buckets,
            density OldDensity,
            (sample_size-num_nulls)/num_buckets pop_bucketSize
          FROM user_tab_col_statistics
          WHERE
              table_name  = 'T1'
          AND column_name = 'N1'
          ),
          user_histograms
        WHERE table_name         = 'T1'
        AND column_name          = 'N1'
        AND endpoint_repeat_count> pop_bucketSize
        GROUP BY ndv,
          BktCnt,
          pop_bucketSize
        );
    BKTCNT  POPBKTCNT  POPVALCNT        NDV POP_BUCKETSIZE NEWDENSITY
---------- ---------- ---------- ---------- -------------- ----------
       100         33          5         37              5   .0209375

Using this computed NewDensity in the above cardinality estimation formula gives this:

E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size)
E-Rows (12c) = 100 * greatest (.0209375, 1/100) = 2.09375 ~ 2

The corresponding 10053 CBO trace backs the above arithmetic as shown in the following:

=====================================
Access path analysis for T1
***************************************
SINGLE TABLE ACCESS PATH 
  Single Table Cardinality Estimation for T1[T1] 
  SPD: Return code in qosdDSDirSetup: NOCTX, estType = TABLE
Column (#2): 
NewDensity: 0.020938, OldDensity: 0.026167 BktCnt: 100, PopBktCnt: 33, PopValCnt: 5, NDV: 37

Column (#2): N1(NUMBER)
    AvgLen: 3 NDV: 37 Nulls: 0 Density: 0.020938 Min: 8.000000 Max: 59.000000
    Histogram: Hybrid  #Bkts: 20  UncompBkts: 100  EndPtVals: 20  ActualVal: yes
  Table: T1 Alias: T1
   Card: Original: 100.  Rounded: 2 Computed: 2.093750 Non Adjusted: 2.093750
  
  Best:: AccessPath: TableScan
         Cost: 3.001739 Degree: 1 Resp: 3.001739 Card: 2.093750 Bytes: 0.000000

Note how this trace file shows the same information (NewDensity, BktCnt, PopBktCnt, PopValCnt and NDV) we have selected above when calculating the NewDensity. The same trace file also shows that the computed cardinality is exactly equal to the one we’ve calculated using the supplied formula.

In order to give more credit to this formula let’s apply it to get the cardinality estimation of another non-popular value having an endpoint_repeat_count of 3 which is greater than 1 as shown below:

SQL> explain plan for select count(1) from t1 where n1 = 43;

SQL> select * from table(dbms_xplan.display);
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |
|   1 |  SORT AGGREGATE    |      |     1 |
|*  2 |   TABLE ACCESS FULL| T1   |     3 |
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=43)
E-Rows (12c) = num_rows * greatest (NewDensity, ENDPOINT_REPEAT_COUNT/ sample_size)
E-Rows (12c) = 100 * greatest (.0209375, 3/100) = 3

Non-Popular value without an endpoint number

In order to calculate the cardinality estimates of a hybrid column value that has not been captured into the histogram, the Optimizer uses the following formula:

E-Rows (12c) = num_rows * NewDensity = 100 * .0209375 = 2.09375 ~ 2

The code below illustrates this formula applied to one of the non-captured values:

SQL> explain plan for select count(1) from t1 where n1 = 17;

SQL> select * from table(dbms_xplan.display);
-------------------------------------------
| Id  | Operation          | Name | Rows  |
-------------------------------------------
|   0 | SELECT STATEMENT   |      |     1 |
|   1 |  SORT AGGREGATE    |      |     1 |
|*  2 |   TABLE ACCESS FULL| T1   |     2 |
-------------------------------------------

Predicate Information (identified by operation id):
---------------------------------------------------
   2 - filter("N1"=17)

As the above formula suggests, all non-popular hybrid histogram values that have not been captured will always have the same constant cardinality estimation whatever their real count is in the table.

Instability

We stated above that when applying rule 3) to ensure the minimum bucket size of 5 when building its Hybrid histogram, Oracle introduces a certain instability. With this actual data set of 100 rows and 37 distinct values, Oracle ended up by making the value 17 as non-popular without an endpoint and 18 as an endpoint number with a endpoint_repeat_count of 2 as shown below:

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT BUCKET_SIZE POPULAR
--------------- ---------- --------------------- ----------- -------
              1 8                              1           5 Non-Pop
              6 13                             3           5 Non-Pop
             12 18                             2           5 Non-Pop

If we add a new instance of 16 the reverse situation will happen:

SQL> insert into t1 values (2, 16);

SQL> BEGIN
        dbms_stats.gather_table_stats
          (user, 't1', method_opt => 'for all columns size 20');
     END;
    /

SQL> select
         uth.endpoint_number
        ,uth.endpoint_actual_value
        ,uth.endpoint_repeat_count
    from
       user_tab_histograms uth
       ,user_tables ut
       ,user_tab_col_statistics ucs
   where
        uth.table_name    = 'T1'
       and uth.column_name   = 'N1'
       and uth.table_name    = ut.table_name
       and ut.table_name     = ucs.table_name
       and uth.column_name   = ucs.column_name
      order by uth.endpoint_number;

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT
--------------- ---------- ---------------------
              1 8                              1
              6 13                             3
             11 17                             1 ***
             16 19                             3
             21 20                             5
             ...
             99 46                             1
            101 59                             1
20 rows selected.

See how “18” has been excluded from the histogram information while “17” has been incorporated into it. This new situation is the result of the following new data set organization:

8 12 12 13 13|13 15 16 16 16|17 18 18 19 19 19

Which, when passed through the Hybrid histogram crunching strategy gives this:

8 12 12 13 13 13| 15 16 16 16 17 |18 18 19 19 19

Add an extra instance of ‘16′ and both 17 and 18 will quit the histogram information while ‘16′ will be recorded with a correct frequency:

SQL> insert into t1 values (3, 16);

SQL> BEGIN
      dbms_stats.gather_table_stats
          (user, 't1', method_opt => 'for all columns size 20');
    END;
 /

ENDPOINT_NUMBER ENDPOINT_A ENDPOINT_REPEAT_COUNT
--------------- ---------- ---------------------
              1 8                              1
              6 13                             3
             11 16                             4 ***
             17 19                             3
             22 20                             5
             ...
            100 46                             1
            102 59                             1
20 rows selected.

Conclusion

Frequency histograms, provided you’ve collected them at the right moment using a representative sampling, can help the optimizer in getting correct cardinality estimates. However, their internal upper bound limit of 254 distinct values drastically limits their usefulness in favor of the imprecise height balanced histogram which needs, for a distinct value to be properly captured, to have one of its instances appearing into at least two endpoints of two distinct buckets. Oracle 12cR1 has not only extended the internal limit of frequency histogram to 2048 but has implemented two new types of histogram: Top-Frequency and Hybrid. The later histogram type combines the best attributes of its two predecessors; that is the accuracy of frequency histogram by counting the number of times a distinct value repeats itself and the bucket saving space of the height balanced histogram by collapsing all instances of the same distinct value into a single bucket without increasing the original bucket number. The hybrid histogram generates more popular values and therefore a greater chance for the Oracle Optimizer to come up with an optimal execution plan.

Code for the examples: InsT1