Histograms Part 1 – Why?

In this short series on histograms we will be looking at the reasons why we need histograms and the methods Oracle has to create them. We will examine the costs of creating them and the potential they introduce for undesirable overheads and then we will review their potential for giving you stability problems in your execution plans. This overview will

In this short series on histograms we will be looking at the reasons why we need histograms and the methods Oracle has to create them. We will examine the costs of creating them and the potential they introduce for undesirable overheads and then we will review their potential for giving you stability problems in your execution plans. This overview will be limited to versions of Oracle prior to 12c, where new types of histograms and collection methods aimed at reducing the overheads and improving stability have appeared

A simple example

Some time ago a client of mine saw some strange performance problems in what seemed to be a very simple part of their application. They were dealing with an online sales system, and during the course of the day they needed to run off regular reports of the form: “show me orders that have recently been placed but have not yet been dispatched”. This requirement turned into a very simple piece of SQL:

At any one instant during the day there would be only a small number of orders matching this predicate – one or two hundred out of a table holding millions of rows. There was an index on the status column to allow for an efficient access path and since new orders would, for simple mechanical reasons, be in a portion of the table that was the most recent group of blocks added to the table and those blocks would be cached in memory, it was assumed that the query would inevitably execute very quickly. The problem was that on some days the report would take tens of seconds to run rather than being virtually instantaneous.

The first diagnostic, of course, was to check that the execution plan was the one expected – was the query trying to do something efficient – there was no need for a second diagnostic. When the query ran quickly Oracle was using the expected index, when it ran slowly Oracle was doing a tablescan. So the question changed from “why is the query slow?” to “why does the optimizer think, some of the time, that a tablescan is a good idea?”

Reading the description of the business activity, and given the title of the article, you’ve probably already got a very good idea. This data set is very skewed and when the optimizer “sees” the skew we get the right plan and when it doesn’t see the skew we get the wrong plan.

Here’s a query (against a model of the data) highlighting the type of problem:

As you can see, most of the data has ended up in one of two states (depending on how the order finally got to the customer) and a tiny volume of data is scattered across a handful of other values. When you see data like this and know that you have a requirement to access the “rare” or “non-popular” values (the latter is Oracle’s preferred terminology) your thoughts are likely to turn in one of two directions – virtual columns (which might mean a function-based index, or the 11g implementation of virtual columns, or even the 11g “extended stats”) or histograms.

Virtual Columns etc.

To my mind the nicest solution comes from virtual columns (or function-based indexes pre-11g) because this allows us to maintain a very small, precisely targetted, index despite the size of the data set. So we might create something like the following:

Although I will need to collect statistics on the hidden column holding the index definition, collecting stats on all hidden columns after creating the index could be expensive way of doing it so I could check the view  user_tab_cols for the most recent column name, which will be something like sys_nc00037$, and collect stats on just that specific column. (Note: technically the “else null” is redundant – but I prefer to include the final option explicitly.)

Of course, I might want to run similar queries for the other non-popular values so I could create two more indexes like the above, or I could create an index that covers the three values – here’s an example using the 11g virtual column approach:

There is a limitation to the virtual column / function-based index approach (whichever method you use) – you have to change the application code to take advantage of it – the “proper” virtual columns in 11g make the code look neater than the FBI code, but there still has to be a change, e.g (for the FBI example that I’ve given):

Histograms

So what do we do if we can’t change the code? We have to make sure that the optimizer is aware of the problem because if we don’t then the basic optimizer model will produce a bad estimate for the cardinality (row count) and choose a bad execution path. At the simplest level the statistics we collect for the optimizer will say: “there are 1,030,000 rows in the table, this column has 5 distinct values and no nulls and the values are distributed evenly from ‘C’ to ‘X’”.  With this information the optimizer’s cardinality estimate for the predicate “status = ‘C’” will be derived as:  total rows / number of distinct values = 206,000. This assumes, of course, that you used a 100% sample (estimate_percent => 100) to gather stats; the results could be a little less predictable if you’re using a version older than 11g or haven’t yet converted to the “approximate NDV” mechanism in 11g.

This is where histograms come into play – they allow us to give the optimizer more detailed information about the distribution of values in a column. Until 12c they come in two flavours: frequency histograms and height-balanced histograms – and in our example we need a frequency histogram. (Note: 12c has two new types of histogram: Top-N and hybrid).

In principle a frequency histogram is an exact picture of the data at a moment in time (and that last phrase is very important) whereas a height-balanced histogram is an approximate image of the data distribution that tries to capture details of popular values and the uneven spread of the rest. A frequency histogram can be created when a column holds no more that 254 distinct values (2,048 in 12c), whereas a height-balanced histogram is much less precise and can’t really capture information about more than 127 popular values. For the rest of this article I’ll restrict myself to frequency histograms.

Frequency Histogram

In our example we have only five distinct values and the model dataset holds just over 1M rows. I can ask Oracle to collect a histogram on the column by gathering table stats with the following setting for the method_opt parameter: for columns status size 254”. (Note, although I know there are only 5 values I might as well ask for the maximum, Oracle will discover that 5 is sufficient). If I also set the estimate_percent to 100 I will end up  with the following in the user_tab_histograms view for this column:

I’ve shown a method of converting the endpoint_value as it’s stored for character columns from its internal numeric form to its character equivalent, if you have ASCII codes at your finger tips you’ll spot lots of spaces (0x20) appended to the hex representation of the endpoint value – for details of how the transformation is actually performed see: http://jonathanlewis.wordpress.com/2010/10/05/frequency-histogram-4/

Note that the histogram is effectively stored as a cumulative frequency, which I’ve unwound by using the lag() analytic function, which allows you to see that Oracle holds exact counts for each of the distinct values in the data.

With this information in place – and assuming that my SQL really does use literal values – when the optimizer calculates the cardinality from the predicate “status = ‘R’” it can check that this is a value in the histogram and report the count it finds recorded there. The frequency histogram is an enormous aid in this particular case so, you may ask, why don’t we simply create histograms for every single column in the application (or possibly just the columns that appear in where clauses)?

Threats with Frequency Histograms

There are four main threats with histograms which I can label with the following bullet points, which I’ll then examine in order:

  • They don’t mix nicely with bind variables
  • They’re expensive to compute
  • They can be very unstable when sampled
  • You have to collect them at the right moment

I made the comment about the optimizer being able to pick the correct entry from the histogram if your query uses the literal value. If you’ve used a bind variable for this query then the optimizer will use “bind-peeking” on the first parse call and still produce the correct cardinality (and execution plan); but until the advent of “adaptive cursor sharing” in 11g, and “adaptive execution plans” in 12c, that one plan was (essentially) the plan you kept for all subsequent executions of the query – no matter how the value of the bind variable changes. In my example, matching the client’s system, a bind variable would have been okay because the ONLY queries against this table were very simple ones for “status = {rare value}”, and the plan for status ‘R’, would have been fine for ‘P’ and ‘S’ – but generally you won’t be that lucky. If you’ve created a histogram on a column then you should expect to do something in the application that lets the optimizer handle the histogram well – and that may mean using literals in the where clause (which can have its own problems), it may mean doing something more subtle such as writing application code to check the users’ requests and pick the most appropriate SQL from a short list to run in each case.

Assuming you’ve worked out how to make best use of a frequency histogram in your code, you still have the problem of making sure that the histogram is accurate when the optimizer wants to read it. Here’s the SQL Oracle ran when I told it to collect the frequency histogram for the above data with a 100% sample size:

The exact query will depend on the version of Oracle and whether Oracle thinks the column needs a frequency histogram or a height-balanced histogram, but the general principle is that you’re going to see an aggregate query that’s going to crunch a large volume of data – and a variant of the query will appear for each column that you identify as the target for a histogram. Gathering histograms (accurately, at least) is an expensive operation.

You could reduce the cost of gathering the histogram by sampling, rather than computing. You will see similar SQL appearing when you do this, though there are variations – in particular Oracle will often copy a sample of the original rows into a global temporary table that it creates for the purpose and then run the queries against the global termporary table. This can result in a lot less work being done to build the histogram – but it introduces a different threat. Here’s the content of the histogram on my original data when I gave Oracle the “auto_sample_size” option to collect a histogram:

If you add up the figures you’ll see that Oracle has sample 5,498 rows from the table – so when it estimates the number of rows for any given value it will check the histogram and multiple by 1,030,000 / 5498 (the numerator is the number of rows according to user_tables.num_rows minus user_tab_cols.num_nulls), so the estimate for status = ‘S’ will be 187, and for status = ‘P’ it will be 375 – both of which are fairly reasonable (especially when compared to the 1,030,000/5 that we would see in the absence of a histogram).

But what do we do about status ‘R’? – it hasn’t appeared in the sample so it hasn’t appeared in the histogram. In this case the optimizer will simply halve the cardinality of the least popular value that it sees in the histogram – so the cardinality would be calculated as 94. Again, in this case, that’s not too bad and it won’t change the critical execution plan, but you can appreciate that if you get a little unlucky in the rows that Oracle samples from day to day your execution plans could change fairly randomly. Can you spot the major threat in this particular set of data ?

What happens if Oracle doesn’t spot ANY of the rare values when sampling the data and ends up with a histogram that says the data is split roughly 50/50 between C and X at about 500,000 rows each? A query for status = ‘R’ will use “half the least popular value” – giving an estimate of about 250,000; and that’s exactly what happened to my client. From time to time the stats gathering routine (which was doing the default 10g overnight collection of stale statistics) would gather stats on this table and miss all the rare values, and for the next 24 hours (or until the next stats collection) the optimizer would decide to use a tablescan on a very large table rather than using a highly appropriate index.

The idea of failing to capture critical information in the histogram leads us to the last critical problem of histograms – what if the critical information is never there when you collect the stats. Imagine that the rare values in my order processing system appear only between 6:00 a.m. and 6:00 p.m. and by 10:00 p.m. they have all been processed out of the system. When the default stats collection runs some time late at night the only values in the table are ‘C’ and ‘X’, but when the queries run in the daytime the only values we’re interested are exactly the ones that weren’t there when the stats were collected. Even with a 100% sample there may be parts of your system that get misleading stats if you gather them at the wrong time. You need to know your system well enough to know when the application code itself should be responsible for the quality of the statistics. This may mean that you write code to gather statistics at a specific time of day; it may mean that you write code that manipulates the stored statistics directly (and we will be looking at that strategy when we look at height-balanced histograms).

Conclusion

When the distribution of data values is highly skewed you need to do something about it if you want to ensure that the optimizer doesn’t produce very bad execution plans as a consequence. If you have control of the application code features like virtual columns or function-based indexes may help you deal with special values; but if you can’t change the code you may need to depend on histograms. Even with histograms in place, though, bind variables can easily cause problems – even with the newer features in 11g and 12c for adaptive cursors and adaptive execution plans.

A brief examination of frequency histograms (the simpler type) showed us how useful they can be for a column with a small number of distinct values – especially if your SQL uses literals. Histograms, even the simple frequency histograms, can be expensive for Oracle to create unless it samples a small fraction of the data, and then it’s possible that the histogram will introduce instability if the really interesting data is transient and a very small fraction of the total. In fact, even if you use 100% sample at the wrong time the resulting histogram can still cause problems thanks to the optimizer’s treatment of missing values.