16 October 2013

4181 views

0 0

Histograms Part 3 – When?

In part 2 we looked at the way that Oracle collects and uses height –balanced histograms, and raised the problem of how sensitive they could be to small changes in the sampling, or in the data. We ended the article with a suggestion on how you could generate data for an “approximate” frequency histogram that could be “good enough” to

In part 3, then, we’re going to look at a couple of cases where allowing Oracle to do its own thing could well be a sensible, and adequate, production strategy. As a general guideline, I’ve only found height-balanced histograms to be reasonably safe when you use them to “take out” a very small number of high frequency values – in effect allowing the optimizer to optimize for the remaining “interesting” data. There are, however, a couple of design errors where you can use histograms as a damage-limitation mechanism,

I had been planning to make some comments about how to incorporate histogram collection into a complete stats collection strategy – but this article became longer than I expected, so I’m going to postpone my thoughts on the actual stats collection implementation to another day.

Database-agnostic applications

There are applications in the wild that have been written to run against “any” database, and they try to minimize their dependencies on any features of any given database. As a consequence, “runs on any database” tends to mean “runs badly on any database”,

From a pure data perspective, two of the most common aspects of agnosticism are:

We do not use NULL – we use a special (extreme) value.
We do not use date datatypes

The side-effects on Oracle’s optimizer can be catastrophic.

We do not use NULL

Here’s a script to create a simple data set to make the point. The data represents 5 years of data (starting on 1^st Jan 2008) at 100 rows per day, with every 1000^th row set to the “pseudo- null” value of 31^st Dec 4000. After creating the data I gathered statistics without generating any histograms, i.e. using method_opt => ‘for all columns size 1’

create table t1

with generator as (

select --+ materialize

rownum id

from all_objects

where rownum <= 2000

)

select

/*+ ordered use_nl(v2) */

decode(

mod(rownum - 1,1000),

0,to_date('31-Dec-4000'),

to_date('01-Jan-2008') + trunc((rownum - 1)/100)

) date_closed

from

generator v1,

generator v2

where

rownum <= 1827 * 100

;

I’m not going to worry about indexes or details of optimum access paths, all I’m interested in at present is the optimizer’s cardinality calculation, and how close it gets to the right answer. Based on the SQL you can see that a query for “all the data for 2010” should return 36,500 rows – so let’s write a query that asks for that data, and check the resulting execution plan for its row prediction:

set autotrace traceonly explain

select

from t1

where date_closed between to_date('01-Jan-2010','dd-mon-yyyy')

and to_date('31-Dec-2010','dd-mon-yyyy')

;

----------------------------------------------------------

---------------------------------------------------------

| 0 | SELECT STATEMENT | | 291 | 2328 | 52 |

|* 1 | TABLE ACCESS FULL| T1 | 291 | 2328 | 52 |

----------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("DATE_CLOSED">=TO_DATE(' 2010-01-01 00:00:00',

'syyyy-mm-dd hh24:mi:ss') AND "DATE_CLOSED"<=TO_DATE(' 2010-12-31

00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

At 291 rows predicted, the optimizer seems to be out by a factor of 125; in most circumstance an error this large is likely to produce a bad choice of execution plan.

Why has this happened? The basic optimizer model assumes that your data is evenly spread between the low and high values, so Oracle’s estimate (ignoring a couple of details round the edges) is:

1	“rows in table with non-null value” * “range you want” / “high value – low value”.

In my case this is (approximately);

1	182700 * 365 / 727928 + 200

Oracle thinks I want one year out of nearly 2,000 rather than one year out of five (with two corrections of 100 rows, which are to allow for end the –points on a bounded range).

At this point I could simply write a little piece of PL/SQL that called dbms_stats.set_column_stats() to change the high value for the column to something much closer to the “proper” high value. If the volume of outlying data is tiny this would be a reasonable strategy, you just have to remember that a query for the “pseudo-null” would then be so far outside the known range that the optimizer would probably give a cardinality estimate of 1 for that value, no matter how much data really existed.

Rather than hacking the statistics, though, I’m going to create a histogram on the column. Typically I would aim for 254 columns (the maximum in 11g), but as a demonstration of how effective they can be I’m actually going to show you the effect of creating a histogram with just 11 columns (2 per year, plus one spare) – this is how the plan changes:

----------------------------------------------------------

----------------------------------------------------------

| 0 | SELECT STATEMENT | | 36420 | 284K| 52 |

|* 1 | TABLE ACCESS FULL| T1 | 36420 | 284K| 52 |

----------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("DATE_CLOSED">=TO_DATE(' 2010-01-01 00:00:00',

'syyyy-mm-dd hh24:mi:ss') AND "DATE_CLOSED"<=TO_DATE(' 2010-12-31

00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

At 36,420 the optimizer’s estimate is just about perfect. When sampling the data, Oracle found that one eleventh of the data in the range from (roughly) July 2012 to December 4000, and found 10/11ths of the data spread evenly between Jan 2008 and July 2012, so the histogram gave it the information it needed to “understand” what the data really looked like in the date range I wanted (it wouldn’t be quite so accurate for dates after 20^th July – which is why I’d really want to use 254 buckets).

Note: most of the articles you see about histograms describe how they help the optimizer to deal with “large spikes” or “popular values” in the data. Histograms can help with ANY pattern in the data that is not boring, flat, unbroken and uniform. In this case the histogram told the optimizer about a big gap in the spread of the data.

Here’s the content of the view user_tab_histograms for this data set (for dates, the stored value is the Julian version of the data):

ENDPOINT_NUMBER ENDPOINT_VALUE EPV_CONVERTED

--------------- -------------- -------------

0 2454467 01-Jan-2008

1 2454633 15-Jun-2008

2 2454799 28-Nov-2008

3 2454965 13-May-2009

4 2455132 27-Oct-2009

5 2455298 11-Apr-2010

6 2455464 24-Sep-2010

7 2455630 09-Mar-2011

8 2455797 23-Aug-2011

9 2455963 05-Feb-2012

10 2456129 20-Jul-2012

11 3182395 31-Dec-4000

We do not use dates

Some applications store dates either as 8 character strings in the format “YYYYMMDD”, or the numeric equivalent. This, conveniently, allows the dates to be human readable while keeping the correct sort order. Here’s a sample data set – again running from Jan 2008 to Dec 2012, but this time only one row per day, and no “pseudo-nulls”

create table t1

select

d1,

to_number(to_char(d1,'yyyymmdd')) n1,

to_char(d1,'yyyymmdd') v1

from (

select

to_date('31-Dec-2007') + rownum d1

from all_objects

where

rownum <= 1827

)

;

I’m going to write something close to the the nastiest possible query I can to make a point – and I’m going to test it first without a histogram on any of the columns, then with a histogram in place. The query is simply going to ask: “how many days are there between 30^th Dec 2011 and 5^th Jan 2012 – the well-informed DBA will work out that the answer is 7 (or possibly 5, or 6, depending on how they interpret “between” and dates). What will the optimizer think?

Here’s the answer when I query the date column:

----------------------------------------------------------

----------------------------------------------------------

| 0 | SELECT STATEMENT | | 8 | 184 | 3 |

|* 1 | TABLE ACCESS FULL| T1 | 8 | 184 | 3 |

----------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("D1">=TO_DATE(' 2011-12-30 00:00:00', 'syyyy-mm-dd

hh24:mi:ss') AND "D1"<=TO_DATE(' 2012-01-05 00:00:00', 'syyyy-mm-dd

hh24:mi:ss'))

It’s pretty close – the error is due to the detail of the optimizer’s handling of ranges where it adds (1/num_distinct) to the selectivity to allow for the end points for bounded ranges.

Then here’s the equivalent (unless you’re running a very old version of Oracle) for the varchar2() column:

----------------------------------------------------------

----------------------------------------------------------

| 0 | SELECT STATEMENT | | 396 | 9108 | 3 |

|* 1 | TABLE ACCESS FULL| T1 | 396 | 9108 | 3 |

----------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("V1">='20111230' AND "V1"<='20120105')

The answer is out by a factor of 79 – not really acceptable, and likely to produce some bad execution plans. If you try the same query on the numeric column, by the way, you’ll get the same result – I think there’s a piece of “special case” optimizer code that uses the arithmetic for numeric data when handling character data the looks like numeric data (i.e. when the low, high, and predicate values all look like numbers).

Where does the error come from ? Oracle can do date arithmetic, so it can calculate the ranges we need accurately: (5^th Jan 2012 – 30^th Dec 2011 = 6 … and that’s the figure that puts the calculation into the right ballpark. On the other hand, if you look at the 8-digit numeric column: 20120105 – 20111230 = 8,875 you may begin to suspect that some strange cardinalities are likely to appear.

If you think about the data as a series of number, you get (roughly) 30 consecutive numbers, then a gap of about 70 as you jump to the next month, and this repeats 11 time, then there’s a gap of about 9,000 as you jump to the next year. The optimizer’s cardinality predictions are going to be all over the place if it thinks your data is spread evenly between 20080131 and 20121231.

So create a histogram – I picked 120 buckets this time, we really need quite a lot of buckets to capture the detail around every single month (and on a production system I’d use 254) – and let’s see the effect:

----------------------------------------------------------

----------------------------------------------------------

| 0 | SELECT STATEMENT | | 15 | 345 | 3 |

|* 1 | TABLE ACCESS FULL| T1 | 15 | 345 | 3 |

----------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

1 - filter("V1">='20111230' AND "V1"<='20120105')

The results are a lot better – not perfect, but possible sufficiently accurate that you get the plan you need automatically. Again, you might note, we haven’t used a histogram to highlight “spikes” in the data – every single value appears the same number of times – we’ve used the histogram to show the optimizer the gaps in the data as you go from low to high.

It’s worth making one more point about this example – the ideal solution to this cardinality problem is to create a virtual column (perhaps through a function-based index in earlier versions of Oracle) that converts the stored data to the correct data type e.g.

alter table t1

add (

d1_virtual generated always as (to_date(v1,'yyyymmdd')) virtual

)

;

begin

dbms_stats.gather_table_stats(

user,

't1',

method_opt=>'for columns d1_virtual size 1'

);

end;

Provided you can modify the SQL to reference the virtual column (or match the function-based index) you’ve got a column of the right data type, with the right statistics to give the optimizer the best chance possible to calculate the correct cardinality.

Conclusion

In the absence of histograms the optimizer assumes that your data is evenly spread from the low value to the high value, with no gaps, and no spikes; if you could draw a graph of your data and see a continuous flat line then that’s the image the optimizer has of your data. If a graph of your data shows big gaps, or a pattern that swings a long way from a flat line, then a histogram may help although, as we saw in part 2, they can be very sensitive to small changes and introduce instability to your execution paths.

There are a couple of scenarios where histograms can be used to address poor application design – using extreme values instead of NULLs and storing data in the wrong data type (most significantly storing dates in character or numeric columns). In these cases the ideal strategy is to correct the design flaw, but it’s often not possible to change the database structure or application so a histogram may help the optimizer deal with the side effects of the design decision.

Histograms Part 3 – When?

Database-agnostic applications

We do not use NULL

We do not use dates

Conclusion

Subscribe for more articles

Rate this article

Jonathan Lewis

Related articles

A Guide to INSERT, UPDATE, and DELETE Statements in Oracle