Massive Deletes – Part 2

In Part 1 of this short series I provided an informal description of a couple of scenarios where we might do a large-scale delete from a table. Without a concrete example though, it can be hard to imagine how the nature of the data deleted and the access paths available could affect the impact a large delete operation could have

In Part 1 of this short series I provided an informal description of a couple of scenarios where we might do a large-scale delete from a table. Without a concrete example though, it can be hard to imagine how the nature of the data deleted and the access paths available could affect the impact a large delete operation could have on your system, so I’m going to spend most of this article talking about a couple of tests on a generated data set. The article may seem a bit long but quite a lot of the space will be taken up by tabular reports.

Sample Data

With the ever-increasing power and scale of hardware it becomes harder and harder to agree on what we might mean by a “large table” or “massive delete” – to one person 1 million rows might seem large, to another 100 million might seem fairly ordinary. I’m going to go with a compromise 10 million rows representing an investment system that has been growing by 1M rows per year for 10 years and has reached a segment size of 1.6GB. This table is, of course, just one of several making up the full system and there will at some point be related data to worry about but, for the present, we’re going to isolate just this table and consider only the table itself and the 4 indexes on it. Here’s the code to generate the data set:

It’s not immediately obvious but the code generates 10 million rows; the date_open starts 120 months (10 years, 3,652 days) in the past and the arithmetic used to increment the value means the most recent entry is at the current date. The date_closed is an integer number of years between 1 and 5 (inclusive) added to the date_open (the table is a simple-minded model of recording fixed-term investments). The deal_type is a randomly-generated single uppercase character – producing 26 distinct values of equal volumes of data; the client_ref is randomly generated as a fixed length string of 4 uppercase characters, giving about half a million combinations and 20 rows per combination.

As a side note – I’ve generated the data set without using rownum anywhere in the high-volume select; this would make it possible for me to use parallel execution to generate the data more quickly (both the “level” and the “rownum” pseudo columns limit Oracle’s ability to use parallel execution). In this case though, because I want the id column to model a sequentially generated value being stored in order of arrival, I’m running the code serially.

Scale

On my laptop, running 12.1.0.2 on a Linux 5 VM I got the following times for creating the data, gathering stats and creating the indexes:

This, of course, is where we have to start asking questions about realism and how different systems might behave. The virtual machine had 4GB of memory allocated (of which 1.6GB was set aside for the memory_target), and a nominal 2 CPUs running at 2.8GHz from a quad core CPU – but possibly most significant was the fact that the machine has 1TB of solid state disc, so doesn’t lose much time on physical I/O. The database was configured with 3 redo log groups of 200MB each (to encourage some delays for log file checkpoint and log file switch waits), the logs were duplexed but the instance wasn’t running in archivelog mode. After stats collection the table block count was roughly 204,000 blocks with 49 rows per block in most blocks, the PK and client indexes had about 22,000 leaf blocks and the two date indexes about 26,500 leaf blocks.

Quality

It’s important when trying to use models like this to question how close to reality they get: so how many flaws can you spot in what I’ve done so far? First of all, the id column is too perfect – ids appear in perfect sequence in the table while in real-life with concurrent single-row inserts there would be a little ‘jitter’, with ranges of consecutive values being scattered over a small number of blocks; this is probably not terribly important. More significantly, I’ve created the indexes after inserting all the data, which means the indexes are physically as perfect as they could be (and with 10% free space per leaf block). I really ought to have created the table and indexes empty then run several concurrent scripts that did single row inserts using a sequence to generate ids – but the last time I did that sort of thing the run time jumped by a factor of 40 (so that’s around 5 hours per run, and I want to drop and recreate this data several times). Again that’s not really terribly significant, though I ought to remember that on a live system the average available space in index leaf blocks might be closer to 30% at any moment, with a significant variation from block to block, and I might want (on a production system) to check the state of the index leaf blocks of the date-based indexes from time to time, and the date_open index in particular.

Scenarios

Despite the fact that any timing is highly dependent on machine configuration and resources and that the model is over-simplified we can still get some interesting information from a few basic tests. Let’s start with a few scenarios relating to business-based decisions:

  • Delete all deals that closed more than 5 years ago
  • Delete all the deals where the client_ref starts with ‘A’ – ‘E’
  • Delete all deals that opened more than 5 years ago

We could imagine that option (a) is a basic archiving requirement – maybe the data has been copied to another table before deletion. Option (b) perhaps tells us that the client_ref  has been (ab)used to encode some significant classification in the first letter for the reference and we’re splitting the data into two processing sets. Option (c) could be part of a process aimed at partitioning the data by date_open (though I’m not sure that it looks like a good way to go about partitioning in this case). Before doing anything that’s likely to be expensive to an Oracle database it’s always a good idea to see if you can visualize what Oracle will have to do and what the execution steps will be, and where the workload will appear. Are these scenarios all the same and, if not, how do they differ? If you don’t know your data and the impact of your delete you can always ask the database – for example:

You’ll notice that I’ve got six commented predicates (in three complementary pairs) in this query. The query is basically designed to give me a summary of how many blocks hold how many rows, but each pair of predicates gives me some idea of the effect of each of my scenarios – one from each pair tells me something about the volume and pattern of data I’ll be deleting, the other tells me about the volume and pattern of data that would survive the delete. The former could give you some idea of how many different blocks you will have to modify and how much work that will take, the latter would help you understand how the system might behave afterwards. With a little bit of SQL*Plus formatting here’s the output after creating the data:

And here’s the output showing what the surviving data would look like if we delete all rows that opened more than 5 years ago (i.e. use the predicate date_open >= add_months(sysdate, -60)).

That’s rather nice – roughly speaking we’ve emptied out half the blocks in the table and left the other half untouched. If we tried a “shrink space” now we would simply be copying the rows from the 2nd half of the table to the first half – we’d generate a huge volume of undo and redo, but the clustering factor (or, specifically, the avg_data_blocks_per_key representation of the clustering_factor) of any indexes would probably be pretty much unchanged. Alternatively if we decide to leave the empty space as it is any new data would simply start filling the empty space very efficiently (almost as if it were using newly allocated extents) from the start of the table – again we’d see the clustering factor (i.e. avg_data_blocks_per_key) of any indexes pretty much unchanged. Compare this with the consequences of deleting all the rows that closed more than 5 years ago (i.e. what do we see left if we use the predicate date_closed >= add_months(sysdate, -60)) – the report is rather longer:

In this case we have roughly 60% of the blocks still holding the original 49 rows, but the rest of the blocks in the table have suffered anything from virtually no deletions to complete emptying (if you check the total number of blocks reported against the total in the first report you’ll notice that there must be a few blocks (24) which are now completely empty). How many of those blocks are now available for insert? Here’s a quick calculation: we had 49 rows in most of our blocks which would have been filled to 90% (default pctree = 10), so a block will drop to the 75% mark (which is when ASSM will flag it as having free space) when it has less than 41 rows in it (49 * 75 / 90): of our 204,000 blocks roughly 75,000 match that criterion (checking the “Blocks Running Total” column). In practical terms this means that if we shrink the table the avg_data_blocks_per_key will increase significantly, and if we leave the table as is and allow new data to be inserted in the free space then the avg_data_blocks_per_key will also increase significantly.

Index Space

The previous section showed some simple SQL that gave you an idea of how space would appear (or data would remain) in the table – can we do something similar for indexes? Inevitably the answer is yes, but the code that answers the question “what would this index look like after I deleted data matching predicate X” is a more expensive piece of code to run than the equivalent for tables. To start with here’s a simple piece of code to check just the current content of an index:

The call to sys_op_lbid() takes a table rowid as one of its inputs and returns something that looks like the rowid for the first row of a block, and that block’s address is the address of the index leaf block that holds the index entry for the supplied table rowid. The other two parameters are the object_id of the index (or partition/subpartition if the index is partitioned), and a flag which identifies the specific use of the function – an ‘L’ in our case. The hint to use an index fast full scan on the target index is necessary – any other path can return the wrong result – and the “client_ref is not null” is necessary to ensure that the query can actually use the index_ffs path legally.

For my initial data set the index had 448 index entries in every block except one (presumably the last, at 192 rows). As you can see, even this simple query has to be crafted to meet the requirements of each index – and because an index fast full scan is needed to get the correct result we have to do something even more unusual to see how our massive delete would affect the index. Here’s a sample to show how we would find out what effect an attempt to delete the rows opened more than five years ago would have on the client_ref index:

As you can see we start off with an inline (hinted non-mergeable) view to attach the index leaf block id to every single table rowid, then we join that set of rowids back to the table – joining by rowid and forcing a hash join. I’ve hinted the hash join because it’s (probably) the most efficient strategy but although I’ve put in a leading() hint I haven’t included a hint about swapping (or not) the join inputs – I’m going to let the optimizer decide which of the two data sets is the smaller and therefore more appropriate for building the hash table.

In this particular case the optimizer was able to use an index-only access path to find all the rowids for the rows where date_open was earlier than 5 years ago, even so (partly because my pga_aggregate_target was relatively small and the hash join spilled to (solid state) disc) the query took 3 minutes 15 seconds to complete compared to the 1.5 seconds for the previous query which I happened to run while the entire index was cached. Here’s an extract of the output:

We’re going to modify 22,322 leaf blocks – that’s every single leaf block in the index; and the number of rows we delete from a leaf block varies from just one to 266. I’ve selected a few rows at a time from the 83 lines output, but you can probably still see that the pattern seems to follow the normal distribution, centered around the 222 (50%) mark. If we do that delete it should be clear that we’re going expend a lot of effort updating this index; even then the simple numbers of “how many rows deleted per leaf block” doesn’t tell us the whole story about the work we’ll do – we don’t know whether we will (for example) delete all 266 index entries at the same time from the last block reported above, or whether we’ll be jumping around the index extremely randomly and find ourselves continuously revisiting that block to delete one index entry at a time (we could write some SQL to get some idea of that aspect of the problem, and it’s not particularly difficult to do, but I’ll leave that as an exercise for the reader). The difference in workload could be dramatic and may be enough to make us drop (or mark unreadable) and rebuild some indexes rather than maintaining them during the delete. So in the next installment we’ll look at what aspects of the workload we need to consider, and how different deletion strategies can have a significant impact on that workload.