Histograms Part 3 – When?

In part 2 we looked at the way that Oracle collects and uses height –balanced histograms, and raised the problem of how sensitive they could be to small changes in the sampling, or in the data. We ended the article with a suggestion on how you could generate data for an “approximate” frequency histogram  that could be “good enough” to

In part 2 we looked at the way that Oracle collects and uses height –balanced histograms, and raised the problem of how sensitive they could be to small changes in the sampling, or in the data. We ended the article with a suggestion on how you could generate data for an “approximate” frequency histogram  that could be “good enough” to help the optimizer do the right thing.

In part 3, then, we’re going to look at a couple of cases where allowing Oracle to do its own thing could well be a sensible, and adequate, production strategy. As a general guideline, I’ve only found height-balanced histograms to be reasonably safe when you use them to “take out” a very small number of high frequency values – in effect allowing the optimizer to optimize for the remaining “interesting” data. There are, however, a couple of design errors where you can use histograms as a damage-limitation mechanism,

I had been planning to make some comments about how to incorporate histogram collection into a complete stats collection strategy – but this article became longer than I expected, so I’m going to postpone my thoughts on the actual stats collection implementation to another day.

Database-agnostic applications

There are applications in the wild that have been written to run against “any” database, and they try to minimize their dependencies on any features of any given database. As a consequence, “runs on any database” tends to mean “runs badly on any database”,

From a pure data perspective, two of the most common aspects of agnosticism are:

  • We do not use NULL – we use a special (extreme) value.
  • We do not use date datatypes

The side-effects on Oracle’s optimizer can be catastrophic.

We do not use NULL

Here’s a script to create a simple data set to make the point. The data represents 5 years of data (starting on 1st Jan 2008) at 100 rows per day, with every 1000th row set to the “pseudo- null” value of 31st Dec 4000. After creating the data I gathered statistics without generating any histograms, i.e. using method_opt => ‘for all columns size 1’

I’m not going to worry about indexes or details of optimum access paths, all I’m interested in at present is the optimizer’s cardinality calculation, and how close it gets to the right answer. Based on the SQL you can see that a query for “all the data for 2010” should return 36,500 rows – so let’s write a query that asks for that data, and check the resulting execution plan for its row prediction:

At 291 rows predicted, the optimizer seems to be out by a factor of 125; in most circumstance an error this large is likely to produce a bad choice of execution plan.

Why has this happened? The basic optimizer model assumes that your data is evenly spread between the low and high values, so Oracle’s estimate (ignoring a couple of details round the edges) is:

In my case this is (approximately);

Oracle thinks I want one year out of nearly 2,000 rather than one year out of five (with two corrections of 100 rows, which are to allow for end the –points on a bounded range).

At this point I could simply write a little piece of PL/SQL that called dbms_stats.set_column_stats() to change the high value for the column to something much closer to the “proper” high value. If the volume of outlying data is tiny this would be a reasonable strategy, you just have to remember that a query for the “pseudo-null” would then be so far outside the known range that the optimizer would probably give a cardinality estimate of 1 for that value, no matter how much data really existed.

Rather than hacking the statistics, though, I’m going to create a histogram on the column. Typically I would aim for 254 columns (the maximum in 11g), but as a demonstration of how effective they can be I’m actually going to show you the effect of creating a histogram with just 11 columns (2 per year, plus one spare) – this is how the plan changes:

At 36,420 the optimizer’s estimate is just about perfect. When sampling the data, Oracle found that one eleventh of the data in the range from (roughly) July 2012 to December 4000, and found 10/11ths of the data spread evenly between Jan 2008 and July 2012, so the histogram gave it the information it needed to “understand” what the data really looked like in the date range I wanted (it wouldn’t be quite so accurate for dates after 20th July – which is why I’d really want to use 254 buckets).

Note: most of the articles you see about histograms describe how they help the optimizer to deal with “large spikes” or “popular values” in the data. Histograms can help with ANY pattern in the data that is not boring, flat, unbroken and uniform. In this case the histogram told the optimizer about a big gap in the spread of the data.

Here’s the content of the view user_tab_histograms for this data set (for dates, the stored value is the Julian version of the data):

We do not use dates

Some applications store dates either as 8 character strings in the format “YYYYMMDD”, or the numeric equivalent. This, conveniently, allows the dates to be human readable while keeping the correct sort order. Here’s a sample data set – again running from Jan 2008 to Dec 2012, but this time only one row per day, and no “pseudo-nulls”

I’m going to write something close to the the nastiest possible query I can to make a point – and I’m going to test it first without a histogram on any of the columns, then with a histogram in place. The query is simply going to ask: “how many days are there between 30th Dec 2011 and 5th Jan 2012 – the well-informed DBA will work out that the answer is 7 (or possibly 5, or 6, depending on how they interpret “between” and dates). What will the optimizer think?

Here’s the answer when I query the date column:

It’s pretty close – the error is due to the detail of the optimizer’s handling of ranges where it adds (1/num_distinct) to the selectivity to allow for the end points for bounded ranges.

Then here’s the equivalent (unless you’re running a very old version of Oracle) for the varchar2() column:

The answer is out by a factor of 79 – not really acceptable, and likely to produce some bad execution plans. If you try the same query on the numeric column, by the way, you’ll get the same result – I think there’s a piece of “special case” optimizer code that uses the arithmetic for numeric data when handling character data the looks like numeric data (i.e. when the low, high, and predicate values all look like numbers).

Where does the error come from ? Oracle can do date arithmetic, so it can calculate the ranges we need accurately:  (5th Jan 2012 – 30th Dec 2011 = 6 … and that’s the figure that puts the calculation into the right ballpark. On the other hand, if you look at the 8-digit numeric column: 20120105 – 20111230 = 8,875 you may begin to suspect that some strange cardinalities are likely to appear.

If you think about the data as a series of number, you get (roughly) 30 consecutive numbers, then a gap of about 70 as you jump to the next month, and this repeats 11 time, then there’s a gap of about 9,000 as you jump to the next year. The optimizer’s cardinality predictions are going to be all over the place if it thinks your data is spread evenly between 20080131 and 20121231.

So create a histogram  – I picked 120 buckets this time, we really need quite a lot of buckets to capture the detail around every single month (and on a production system I’d use 254) – and let’s see the effect:

The results are a lot better – not perfect, but possible sufficiently accurate that you get the plan you need automatically. Again, you might note, we haven’t used a histogram to highlight “spikes” in the data – every single value appears the same number of times – we’ve used the histogram to show the optimizer the gaps in the data as you go from low to high.

It’s worth making one more point about this example – the ideal solution to this cardinality problem is to create a virtual column (perhaps through a function-based index in earlier versions of Oracle) that converts the stored data to the correct data type e.g.

Provided you can modify the SQL to reference the virtual column (or match the function-based index) you’ve got a column of the right data type, with the right statistics to give the optimizer the best chance possible to calculate the correct cardinality.

Conclusion

In the absence of histograms the optimizer assumes that your data is evenly spread from the low value to the high value, with no gaps, and no spikes; if you could draw a graph of your data and see a continuous flat line then that’s the image the optimizer has of your data. If a graph of your data shows big gaps, or a pattern that swings a long way from a flat line, then a histogram may help although, as we saw in part 2, they can be very sensitive to small changes and introduce instability to your execution paths.

There are a couple of scenarios where histograms can be used to address poor application design – using extreme values instead of NULLs and storing data in the wrong data type (most significantly storing dates in character or numeric columns).  In these cases the ideal strategy is to correct the design flaw, but it’s often not possible to change the database structure or application so a histogram may help the optimizer deal with the side effects of the design decision.