Massive Deletes – Part 3

In the previous installment of this series we looked at methods for seeing the pattern of the changes that would appear in a table and its indexes after a large delete. In this installment we move on to the workload implied by different patterns, then think about strategies or, indeed, the need for strategies for reducing that workload at the

In the previous installment of this series we looked at methods for seeing the pattern of the changes that would appear in a table and its indexes after a large delete. In this installment we move on to the workload implied by different patterns, then think about strategies or, indeed, the need for strategies for reducing that workload at the time when it matters most.

Basic Overheads

Apart from time itself, there are three measures that we are most likely to be interested in as our massive delete progresses: the volume of undo generated, the volume of redo generated, and the volume of I/O that takes place. Unfortunately these measures aren’t independent of each other – and that introduces a lot of complexity into a general discussion of optimizing a delete – and the volumes can have an impact on other sessions that are running concurrently with the delete while, conversely, the effects of other sessions can produce feedback that affects the performance of the delete.

Redo (part 1): every change to a data block generates redo – a description of how you’ve modified the block plus an overhead of a few tens of bytes: and if you’ve got four indexes on the table then you may be changing 5 blocks (one table block, four index leaf blocks) for every row you delete.

Undo: Every time you delete a row you “preserve” the row by writing into the undo segment a description of how to re-insert that row, and how to “undelete” the index entries. Each undo entry has an overhead of a few tens of bytes and each undo entry is a change to a data block so generates more redo (part 2). Combining the two sources of redo, a single row delete with four affected indexes could produce around 1KB of redo, plus the size of the row and index entries deleted.

I/O effects: We may have to read a data block into the buffer cache to delete a row; we may have to read into the buffer cache each index leaf block affected by the deletion of the row. At some stage we will have to write out the changed data (and index) blocks – and that’s going to include the undo blocks that we’ve been changing. We’ll have to wait for the reads of course but, fortunately, the writes might be handled in the background and might not affect the rate at which we can delete; on the other hand for a very large delete we might find that the volume of undo we generate is so large and the reads of the table and indexes so random that we overwhelm the buffer cache and find we are constantly forcing the database writer to write dirty blocks to make buffers free for the reads that our session wants to do. While, technically, the writes are background writes our session may end up suffering wait times (for event write complete waits) that are almost as bad as waiting for local (foreground) writes.

Concurrency (1): the effects of undo and redo don’t stop at our session though – even if the only other activity on the system is users running “read-only” reports. If other sessions have to examine blocks that we have modified (and not yet committed) then they will probably have to start using undo blocks to check for commit times and produce read-consistent versions of the blocks we’ve changed so that they can see the rows we are deleting. This has two effects – other sessions may end up physically reading undo blocks (increasing the I/O load), and then forcing the database writer to write dirty buffers to disc to make buffers free for the read-consistent copies they want to create (further increasing the I/O load). Moreover, if another session has to read from disc a block that we have changed then one of the first things it will do is prepare to apply “delayed block cleanout” for the uncommitted change (i.e. our delete) that it will find there – and even though it will then discover that it doesn’t need to do a cleanout it will still have generated 60 bytes of redo, and every time such a block is read from disc the reading session will generate another 60 bytes until we finally commit and the next session to read the block performs a “proper” delayed block cleanout. If you’ve ever noticed that a large-scale delete slows down the longer it runs these concurrency effects are probably one of the key reasons.

Concurrency (2): If other sessions are modifying the data the concurrency problems can be much worse – even ignoring the possibility that another session may be locking a row we want to delete and leave our session waiting on a TX lock. Even if no other sessions changes any of the rows we are trying to delete they may change rows that share blocks with the rows we want to delete, and we will have to create read-consistent copies of those blocks to check whether data that is currently there was there when we started our delete, and that data that was there when we started hasn’t since disappeared – we need to do a “write-consistent” delete: at best this means we could do a lot of work checking, at worst it means we might find that our session finds an inconsistency it can’t cope with and rolls back and restarts the delete automatically. The latter may be a little rare, the former is another of the two key reasons why big deletes often end up getting slower and slower as they run.

Some interesting numbers

In the previous installment I supplied some code to create a table with four indexes to use as a basis for some thought experiments, and now now going to use that data to produce some performance figures. Before I do so, though, it’s worth mentioning how the optimizer calculates the cost of a delete. It’s very simple: the cost is the cost of the equivalent “select rowid from …” query to identify the rowids of the rows to be deleted. The optimizer doesn’t allow for the cost of visiting the table to delete the row, nor for the cost of maintaining the indexes.

In a simple case like “delete from t1 where client_ref < ‘F’” this means Oracle will choose one of three execution plans – a tablescan of t1, an index range scan of the index on client_ref, or an index fast full scan of the index on client_ref. In fact, until 12c, the optimizer would not choose the index fast full scan (though you could hint the path) – this was eventually considered to be a bug and fixed in 12.1.0.2.

With that in mind, let’s look at a few figures for two delete statements: the first will delete all the rows where the date_open is more than five years ago (i.e. the oldest 50% of the data), the second will delete clients with reference codes starting with the letters A to E (a little less than 20% of the data, scattered across the whole table).

I’m going to set up six different scenarios and report 4 key statistics. The first scenario deletes from the table when there are no indexes, the second when just the primary key index in place, the third with the primary key and client_ref indexes in place – in all three of these cases the delete will follow a full tablescan.

The last three scenarios will have all four indexes (primary key, date_open, date_closed and client_ref) in place; the first scenario will use a tablescan, the second will use an index fast full scan on the date_open index – the path that appeared by default, in fact, with 12c – the last will use an index range scan on the date_open index.

From the session activity (v$sesstat) I’ll report the number of redo entries, the redo size, and the undo change vector size; and I’ll finish with the actual execution time for each delete. I have to warn you that the execution time isn’t really a good indicator for the general case, since (a) I’m using solid state discs so any I/O will be very quick and (b) the run-time can be affected most significantly by the size of the buffer cache compared to the number of blocks accessed and modified.

One other point that needs to be stated explicitly – after each test I drop (with purge) and recreate the table, so whatever happens on a test run is independent of previous tests.

 

Redo entries

Redo size (GB)

Undo change vector size (GB)

Time (mm:ss)

No indexes

5.312 M

2.092

1.296

0:37

PK index

10.325 M

3.258

1.736

1:05

PK & client_ref

15.344 M

4.383

2.157

1:31

4 indexes + Table scan

25.368 M

6.672

3.036

2:20

4 indexes + Index fast full scan

25.353 M

6.676

3.038

2:12

4 indexes + Index range scan

5.446 M

2.476

1.619

1:12

Looking at the top three results – tablescans with zero, one and two indexes we can see that we get 5M redo entries corresponding to each row deleted from the table, plus one redo entry per index per row as we add indexes. The redo size increases fairly linearly with the number of indexes, as does the undo change vector size and (approximately) the time to execute.

This pattern continues into the case where we have all 4 indexes in place and delete using the tablescan or the index fast full scan – though there is a tiny variation in resource usage and a small variation in time between the two.

We then get an extraordinary change as we drive the delete through the index on date_open, the number of redo entries drops dramatically – almost back to the “no index” delete – as do the undo and redo size in GB, and the time to delete halves. What has changed?

It’s not a well-known phenomenon, but Oracle uses a completely different strategy for deleting through an index range/full scan from the strategy it uses when deleting through a tablescan or index fast full scan. For the tablescan/index fast full scan Oracle deletes a row from the table then updates each index in turn before moving on to delete the next row.  For the index range scan / full scan Oracle deletes a table row but records the rowid and the key values for each of the indexes that will need to be updated – then carries on to the next row without updating any indexes. When the table delete is complete Oracle sorts all the data that it has accumulated by index, by key, and uses bulk updates to do delayed maintenance on the indexes. Since bulk updates result in one undo record and one redo change vector per block updated (rather than one per row updated) the number of redo entries can drop dramatically with a corresponding drop in the redo and undo size. In our example we get a massive saving because we are updating in the order of 22,000 blocks per index (22,000 undo/redo entries) rather than 5 million rows per index.

But there’s an important pair of statistics that appeared only for the delete by range scan:

That 20M rows sorted is 5M rows x 4 indexes. With my memory sizing the sorting spilled to disc (as 4 separate sort operations, though I think I used to see the entire operation handled as a single combined sort in earlier versions of Oracle). So you have a choice of mechanism for doing large deletes, and a choice of HOW you use resources that may impact significantly on where the workload appears and how much impact it has. (Note: The same variation in strategy is available on an update.) 

The fact that there is a variation, and my first example showed a huge benefit for the index-driven approach,  doesn’t mean that driving through an index is automatically the best thing to do. You still  have to think about the patterns in the data. Here’s my second example – with all four indexes in place and showing the three possible execution plans (tablescan, index range scan, index fast full scan).

In this case the delete is again for data that is indexed, but the data is evenly distributed across the table rather than being sequentially arranged at the start of the table. The number of rows to be deleted is about 1.9M:

 

Redo entries

Redo size (GB)

Undo change vector size (GB)

Time (mm:ss)

Index range scan (client_ref)

2.168 M

0.967

0.627

2:32

Index fast full scan (client_ref)

9.779 M

2.571

1.168

8:28

Tablescan

9.781 M

2.571

1.168

1:08

Again we see that driving the delete through the appropriate index results in the smallest number of redo entries and minimum redo size; however the tablescan, despite generating significantly more undo and redo, executes in less than half the time; and the index fast full scan, although generating virtually identical figures for undo and redo runs nearly 3.4 times more slowly than the index range scan.

To understand why the best (time-focused) strategy for this delete is the opposite of the best strategy for the previous delete we need to find where the time was spent, and that leads us to the data file I/O pattern and, if we want more detail, into the segment statistics (v$segstat / v$segment_statistics). The simplest headline figures are as follows:

An examination of where the physical I/O takes place (which segments) helps us understand what has happened.

When we drive through the index range scan we do a very efficient job of delayed maintenance of the 4 indexes – but the pattern of the data for the range of client_ref values means we keep jumping all over the table, constantly flushing out of memory then re-reading the blocks that we need to update. Of the 1.275M single block reads 1.19M are blocks from the table – and there were only 204,000 blocks in the table; we’ve read each block five times (effectively once for each of the letters from ‘A’ to ‘E’) .

When we do a fast full scan on the index (which, unfortunately, was the optimizer’s choice when not hinted) we have the same problem plus we update the indexes one row at a time and jump around the other three indexes in a fairly random manner – which means we end up with fewer buffers to keep the table cached and do more re-reads on the table, together with re-reads on the index blocks.  The 2.86M reads for the fast full scan method are comprised of 1.75M reads from the table, 274K reads from the primary key and 430K each from the two date-based indexes compared to just 25,000 reads for the client_ref index that we are using to drive the index fast full scan.

When we do the tablescan we read every table block just once – no jumping around and re-reading there – and walk smoothly through two of the indexes (primary key and date_open) while jumping around a little more randomly on the date_closed index and focusing on a small section (5/26) of the client_ref index which conveniently stays very well cached as a consequence. The number of random reads on the indexes are 22K, 26K, 40K and 4.5K respectively.

What we are seeing is just another minor variation of the way we have to think about indexes and the old question: “why isn’t Oracle using my index?”. Whenever we look at an index we ask two questions: “how much data will we visit?”, “how scattered is that data?” The choice between using an index or using a tablescan to drive a delete is really just another variant on the choice the optimizer has to make when deciding how to select data from a table.

Choices

Ultimately we tend to have two choices – do we delete by index range scan or by tablescan.  (For more complex deletes the choice is likely to disappear, the simple index range scan with, with minor variations, seems to be the only special case). If we do have the choice we need to think about each index in turn and ask the questions:

  • If I tablescan, how do I jump around this index, how much benefit will I get from caching for this index. Is this index such a threat for random I/O that I need to consider dropping (or disabling) it while doing the delete. If I drop this index (and, perhaps a couple of other indexes) will that allow the rest of the indexes to stay cached while I do the delete.
  • If I (can) use this index to drive the delete how do I jump around the table, how much benefit will I get from the delayed index maintenance feature, how much do I suffer from random I/O in the table and failure to cache the table blocks. Is the threat from random table I/O something I can ignore because of the reduction in redo logging, redo log archiving etc. How much work will I do sorting the related index entries that I will eventually delete.

As we see so often, it’s not always easy to make the best choice – but it’s nice to know that there is a choice to be made. As a very tentative guideline – if you can drive your delete through an index that has well clustered data behind it then that’s likely to be a case where the index path will be a better choice than the tablescan path. As with all guidelines, though, there are other traps to consider – and I’ll be looking for those in a future article.

Apart from anything else, though, I think that there’s one particularly important yet easy detail you should aim to remember about this article: when you upgrade to 12c the optimizer has a third choice of access path for a simple delete – driving through an index fast full scan – and it’s just possible that this path will be worse than anything that’s happening at present but will be chosen by the optimizer because it’s the access path with the least cost for “select rowid from table…”.