Basics of the Cost Based Optimizer – Part 5

It’s been three months since I last published anything in this series on the Cost Based Optimizer, so a thumbnail sketch of the previous installments seems to be in order: In part 1 I gave an informal overview of how the optimizer would “think” about a simple query. In part 2 I created a data set, and ran a few

It’s been three months since I last published anything in this series on the Cost Based Optimizer, so a thumbnail sketch of the previous installments seems to be in order:

  • In part 1 I gave an informal overview of how the optimizer would “think” about a simple query.
  • In part 2 I created a data set, and ran a few single-table queries that displayed data patterns, highlighted one of the problems that the optimizer has estimating the resources needed to handle a query, and supplied an official corection to a common optimizer problem.
  • In part 3 I showed how the optimizer’s basic algorithms for hash joins and nested loops can easily give the better plan a higher cost than the worse plan thanks to the optimizer’s ignorance of caching effects.
  • In part 4 I introduced the “unit of optimisation” – a simple join between tables – and explained how the optimizer will attempt to transform a complex query into something simpler – typically with fewer query blocks – before optimizing query blocks and estimating the effects of combining those query blocks. The example of the “filter subquery” showed two ways in which the optimizer’s estimates could go wrong.

In this installment I’ll be looking at other ways in which the optimizer can choose poor execution plans and giving some pointers on how to spot the source of the error quickly, and a couple of strategies for addressing some of these errors.

Sources of errors

We’ve already seen that the basic cost of a nested loop join is “number of rows selected from first table” * “cost of selecting one related set of rows from second table”. We’ve noted that for an index the clustering_factor has a great impact on the cost of selecting the nominal “one related set”, and that 12c (with a backport to version 11.2.0.4) has given us a mechanism to tell Oracle how to derive a more realistic value for the clustering_factor. Given the multiplication, though, we also need to know a little more about how Oracle can produce a bad estimate for the “number of rows selected” and what we can do about helping the optimizer to get a better estimate; so we need to start thinking about selectivity.

Cardinality and Selectivity

Although, as users of data, we tend to think in terms of “number of rows” (the cardinality) most of the critical arithmetic the optimizer uses is about the “fraction of rows” (the selectivity). If we execute a query like “select * from all_objects where type = ‘SYNONYM” we might see a plan where the “Rows” (cardinality) column predicts 2,365, but under the covers the optimizer has been working with the fraction (selectivity) 1/30, and has applied that to estimate it has of 70,963 rows being in the table. 

The big problem with selectivity is that Oracle often doesn’t have a well-informed mechanism for working out the selectivity it should apply at each step of an execution plan. If this is the case then it has one of two options: one option is simply to guess and use a fairly arbitrary fraction, the second is to apply a rational statistical approach that isn’t necessarily appropriate for your data.

Here’s a little script to generate some data that we can use to demonstrate a few of the problems. It’s worth running the examples I’m going to show (and trying variations on the theme) to get a better feeling of how easy it is to spot the underlying cause of a bad cardinality estimates once you’ve had a little practice. I’ve used 11.2.0.4 in all the following cases.

I’ve generated 1 million rows – it took just a few seconds to do this – the mod_200 column holds 200 distinct values (5,000 rows per value) evenly spread throughout the table, the mod_200_fudge column is closely correlated with mod_200, but I’ve introduced a small random adjustment so that for each value of mod_200 the column mod_200_fudge has roughly 1,000 rows for each of 5 consective values starting from the mod_200 value.

The trunc_5 column holds 200,000 distinct values, clustered in groups of 5 rows with the same value, starting at 0, ending with 199,999. This 5 row cluster has then been repeated into the date-based five_per_sec column which starts 200,000 seconds into the past and holds 5 rows per second for 200,000 seconds.

The final column is for some very skewed data, generated using the square root function. Thanks to the trunc() call it holds 1,000 distinct values (from 0 to 999); there’s just one row with the value 0, 3 with the value 1, 5 with the value 2, and so on, (2N + 1 rows with the value N) until we get to 1999 rows with the value 999. Despite the uneven data distribution I’ve only gathered simple stats on this column: in this installment I’m not going to say anything about the effects of histograms.

Demonstrations

I’m going to run a number of queries and then pull the execution plan from memory after doing so. I’ve used the following framework, which I’ll only show for the first example:

Here’s the complete list of queries we’ll be examining over the next couple of articles:

I’ll report the basic execution plan for each query and give a brief description of what the query is trying to demonstrate, in individual sections below.

function(column) = constant

The optimizer doesn’t know what the effect of a function will be. We can look at the function and the input data and, in this case, we know that most of the data will satisfy the predicate since sign(mod_200) will be 0 only when mod_200 is zero and it will be 1 for the rest of the data. The query returns the result 995,000, but the plan predicts 10,000. This is a fixed guess of 1%:

Fortunately we can work around this type of problem with a virtual column if we decided that the estimate is bad enough to be a threat:

It’s important to note that when you create a virtual column you need to collect statistics on that column before it will have the effect you expect – though you may still get lucky if you forget. You can add virtual columns until you hit the standard limit of 1,000 columns in total for the table – but don’t be too hasty to add virtual columns to solve every problem, remember that Oracle will be collecting stats on all those columns during its automatic stats collection job. (And if you’re using partitioned tables and doing everything on full automatic with incremental statistics then you’ll be creating and storing synopses as well).

I’ve gathered stats with “for columns XXX size 3” in this case because the sign() function can only return 3 possible values (-1, 0, 1) and since I’m expecting a massive data skew it may help me to have a frequency histogram on the column.

With the virtual column in place – and no change to the code – the optimizer’s estimate for the number of rows returned jumped to the correct value.

With a slightly different requirement – what if I’m interested only in the small number of rows where the sign() is zero – I might choose to create a different virtual column and index it:

This gives us an index that is as small as possible – it contains entries only for the rows that we are interested in. This demonstrates a useful principle, even though in this particular case it’s neither sensible nor useful as there’s too much relevant data for the index to be effective, and even if the target data set was much smaller we wouldn’t index the sign() function we would just create a function-based index to identify the rows where mod_200 itself was zero.

Function(column) > constant

In this case I’ve copied the current date into a substitution variable before using it. As before the optimizer has no idea of the effect of the function on the column.

Because I created the data by working backwards from sysdate this query will find no matching rows – but the optimizer guesses otherwise:

The guess is 5%. The same would be true for variations on this range-based predicate while a BETWEEN predicate would report a selectivity of 0.25% (which, not coincidentally, happens to be 5% of 5%). The same 5% (and, mutatis mutandis, 0.25%) appears with predicates of the form “column > {unknown value}”.

Again we can add a virtual column and gather stats – though in an example of this type, but with a much larger date range, a histogram isn’t likely to be appropriate:

Again, with no change to the query text, the predicted number of rows changes. Unfortunately this specific example now demonstrates a different problem.

The optimizer now thinks we will get 333,000 rows when we know we will get no data. The problem (that we will see again in the next installment) is that our predicate is outside the known range (low/high) values of the virtual column stats, so the optimizer has used 1/num_distinct as the selectivity – which is another unfortunate guess about what it should do when the predicates are not “nice”. Since I had only 3 distinct values for trunc_five_per_sec the predicted row count is 1,000,000/3 which is the 333,333 that you see.

Comparing columns

Although it may seem a little unrealistic to examine a query where we compare two columns in the same table, the principle I’m about to demonstrate applies to comparing columns across tables in join queries.

We know that for any one value of mod_200 there will be roughly 1,000 rows with the same value for mod_200_fudge and 4,000 spread fairly evenly across the next 4 consecutive values. Since there are 200 distinct values involved we expect the resulting count to be roughly 200,000 – but here’s the optimizer’s estimate:

The rationale here is that the optimizer knows that there are 200 distinct values for mod_200 and 204 distinct values for mod_200_fudge, so it has used the larger number of distinct values (lower value for selectivity) to calculate the expected cardinality. In effect it has considered the two predicates “mod_200 = {constant}” and “mod_200_fudge = {constant}” and used the arithmetic for the predicate that will return the smaller number of rows: 1,000,000 / 204 = 4901.96.

In this case you might have to depend on dynamic sampling to get a better estimate – except virtual columns can still give you a way around this (specific) problem. If mod_200 = mod_200_fudge then mod_200_fudge – mod_200 = 0: so let’s create that virtual column, gather stats, and modify the query. Note that in this case I expect the virtual column to hold 5 distinct values of about 200,000 rows each – in other cases where the data follows a more interesting pattern you might want to create a histogram, or even introduce a more sophisticated virtual column (and matching code change) in order to take advantage of some precision indexing.

With the column (and stats) and change in code the optimizer has correctly predicted 200,000 rows for the query. (I’ve included the query and predicate section of the dbms_xplan output here to show that while I used the virtual column for the query Oracle has managed to display the underlying columns as the meaning of the predicate).

It’s worth noting that adding virtual columns can expose badly written applications to a performance threat, or even a risk of crashing. First, if your application contains code that uses the “select *” notation then it will select the virtual columns: this may simply increase the performance costs of data movement across a network, but if the code isn’t designed to deal with an open-ended list of columns you may find the code crashes because you don’t have enough receiving variables for the select list. Secondly if your application has code that does “insert into tableX values()” without specifying a list of table columns then it will crash with Oracle error “ORA-00947: not enough values”. We can bypass both problems in 12c because we can declare the virtual columns to be invisible.

Combining Columns

This is a very well-known problem. Oracle assumes that there is no correlation between the different columns in a table, and derives the selectivity for ANDs and ORs of simple predicates using the rules of probability theory for independent variables which I’ll describe after showing you some results:

We know that of the 5,000 rows where mod_200 = 100 roughly 1,000 rows will have mod_200_fudge = 100; so the optimizer’s estimate is far too low. This is because the optimizer doesn’t have any information to tell it that the two columns are closely related. Its arithmetic has said:

1 in 200 rows will have mod_200 = 100, and 1 in 200 rows will have mod_200_fudge = 100, so one row in 200 * 200 will report both predicates true, which gives 1,000,000/40,000 = 25 rows.

Oracle allows us to create column groups to address this type of problem – they are only effective when all the predicates involved use equality, and they cease to be effective if the individual columns have histograms (but see Chris Antognini’s note, especially the addendum) or if the predicates go outside the low/high range for any of the columns.

A quick check of view user_tab_cols shows us that we have a new (hidden, system-generated) column called SYS_STUHEAW8XLV$A#9LXJ65QQJU13 holding 1,000 distinct values. Internally has created a virtual column using a hash function called sys_op_combined_hash(), and then calculated the selectivity for the predicate: “sys_op_combined_hash(mod_200, mod_200_fudge) = sys_op_combined_hash(100,100)”.

You can create a maximum of 20 “column groups” like this – so don’t waste the limited scope the option gives you. In this particular case there is an alternative that would allow us to avoiding “using up” one of the limited supply. We’ve already created a virtual column mod_200_diff based on (mod_200_fudge – mod_200), look what happens if we rewrite our query:

The query is logically identical to the original, and we’ve got the correct cardinality estimate by using a virtual column rather than a column group. The arithmetic has worked as follows:

  • Selectivity of (mod_200 = 100) is 1/200
  • Selectivity of (mod_200_diff) = 0 is 1/5
  • Selectivity of (mod_200 = 100 and mod_200_diff = 0) is 1/200 * 1/5 = 1/1000.

Because of the carefully constructured patterns I created in the data, these numbers happen to be a valid representation of the data. You’re not likely to find such a perfect model in real data, but you may spot occasions where something like this is a good enough approximation to allow you to give the optimizer a much better idea of cardinality than it would otherwise have.

A word of warning when upgrading to 12c. If you leave all the automatic optimization features enabled Oracle is capable of creating, and gathering stats on, column groups without giving you any indication that it has done so. This is a side effect of the cardinality feedback, dynamic statistics, and SQL Directive mechanisms. Since you can only have 20 sets of extended stats per table you may need to keep a close eye on the views user_tab_cols and user_stat_extensions to make sure that you know what column groups have been created on your tables.

Conclusion

This article has already gone on too long and there are still several cases I haven’t yet covered where the optimizer has problems getting reasonable cardinality estimates. The cases we have covered, though, can often be handled very nicely by creating virtual columns or the “column group” variation of extended statistics.

In the next article I’ll carry on working through the list of problems introduced at the start of this article, where we’ll find that there are some problems that are much harder to deal with.

 

–> Catalogue of current articles in CBO series.