In Part 1 of this short series I provided an informal description of a couple of scenarios where we might do a large-scale delete from a table. Without a concrete example though, it can be hard to imagine how the nature of the data deleted and the access paths available could affect the impact a large delete operation could have on your system, so I’m going to spend most of this article talking about a couple of tests on a generated data set. The article may seem a bit long but quite a lot of the space will be taken up by tabular reports.

Sample Data

With the ever-increasing power and scale of hardware it becomes harder and harder to agree on what we might mean by a “large table” or “massive delete” – to one person 1 million rows might seem large, to another 100 million might seem fairly ordinary. I’m going to go with a compromise 10 million rows representing an investment system that has been growing by 1M rows per year for 10 years and has reached a segment size of 1.6GB. This table is, of course, just one of several making up the full system and there will at some point be related data to worry about but, for the present, we’re going to isolate just this table and consider only the table itself and the 4 indexes on it. Here’s the code to generate the data set:

execute dbms_random.seed(0)

create table t1 (
	id		not null,
	date_open, 	date_closed,
	deal_type,	client_ref,
	small_vc,	padding
)
nologging
as
with generator as (
	select	/*+ materialize cardinality(1e4) */
		rownum	id 
	from	dual
	connect by
		rownum <= 1e4
)
select
	1e4 * (g1.id - 1) + g2.id					id,
	trunc(
		add_months(sysdate, - 120) + 
			(1e4 * (g1.id - 1) + g2.id)* 3652 / 1e7
	)								date_open,
	trunc(
		add_months(
			add_months(sysdate, - 120) + 
				(1e4 * (g1.id - 1) + g2.id) * 3652 / 1e7,
			12 * trunc(dbms_random.value(1,6))
		)
	)								date_closed,
	cast(dbms_random.string('U',1) as varchar2(1))	deal_type,
	cast(dbms_random.string('U',4) as varchar2(4))	client_ref,
	lpad(1e4 * (g1.id - 1) + g2.id,10)		small_vc,
	rpad('x',100,'x')				padding
from
	generator	g1,
	generator	g2
where
	g1.id <= 1e3
and     g2.id <= 1e4
;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

alter table t1 add constraint t1_pk primary key(id) using index nologging;
create index t1_dt_open on t1(date_open) nologging;
create index t1_dt_closed on t1(date_closed) nologging;
create index t1_client on t1(client_ref) nologging;

It’s not immediately obvious but the code generates 10 million rows; the date_open starts 120 months (10 years, 3,652 days) in the past and the arithmetic used to increment the value means the most recent entry is at the current date. The date_closed is an integer number of years between 1 and 5 (inclusive) added to the date_open (the table is a simple-minded model of recording fixed-term investments). The deal_type is a randomly-generated single uppercase character – producing 26 distinct values of equal volumes of data; the client_ref is randomly generated as a fixed length string of 4 uppercase characters, giving about half a million combinations and 20 rows per combination.

As a side note – I’ve generated the data set without using rownum anywhere in the high-volume select; this would make it possible for me to use parallel execution to generate the data more quickly (both the “level” and the “rownum” pseudo columns limit Oracle’s ability to use parallel execution). In this case though, because I want the id column to model a sequentially generated value being stored in order of arrival, I’m running the code serially.

Scale

On my laptop, running 12.1.0.2 on a Linux 5 VM I got the following times for creating the data, gathering stats and creating the indexes:

Table creation:	7:06.40
Stats gathered:	0:10.54
PK added:	0:10.94
Index created:	0:10.79	(date_open)
Index created:	0:12.17	(date_closed)
Index created:	0:13.65	(client_ref)

This, of course, is where we have to start asking questions about realism and how different systems might behave. The virtual machine had 4GB of memory allocated (of which 1.6GB was set aside for the memory_target), and a nominal 2 CPUs running at 2.8GHz from a quad core CPU – but possibly most significant was the fact that the machine has 1TB of solid state disc, so doesn’t lose much time on physical I/O. The database was configured with 3 redo log groups of 200MB each (to encourage some delays for log file checkpoint and log file switch waits), the logs were duplexed but the instance wasn’t running in archivelog mode. After stats collection the table block count was roughly 204,000 blocks with 49 rows per block in most blocks, the PK and client indexes had about 22,000 leaf blocks and the two date indexes about 26,500 leaf blocks.

Quality

It’s important when trying to use models like this to question how close to reality they get: so how many flaws can you spot in what I’ve done so far? First of all, the id column is too perfect – ids appear in perfect sequence in the table while in real-life with concurrent single-row inserts there would be a little ‘jitter’, with ranges of consecutive values being scattered over a small number of blocks; this is probably not terribly important. More significantly, I’ve created the indexes after inserting all the data, which means the indexes are physically as perfect as they could be (and with 10% free space per leaf block). I really ought to have created the table and indexes empty then run several concurrent scripts that did single row inserts using a sequence to generate ids – but the last time I did that sort of thing the run time jumped by a factor of 40 (so that’s around 5 hours per run, and I want to drop and recreate this data several times). Again that’s not really terribly significant, though I ought to remember that on a live system the average available space in index leaf blocks might be closer to 30% at any moment, with a significant variation from block to block, and I might want (on a production system) to check the state of the index leaf blocks of the date-based indexes from time to time, and the date_open index in particular.

Scenarios

Despite the fact that any timing is highly dependent on machine configuration and resources and that the model is over-simplified we can still get some interesting information from a few basic tests. Let’s start with a few scenarios relating to business-based decisions:

  • Delete all deals that closed more than 5 years ago
  • Delete all the deals where the client_ref starts with ‘A’ – ‘E’
  • Delete all deals that opened more than 5 years ago

We could imagine that option (a) is a basic archiving requirement – maybe the data has been copied to another table before deletion. Option (b) perhaps tells us that the client_ref  has been (ab)used to encode some significant classification in the first letter for the reference and we’re splitting the data into two processing sets. Option (c) could be part of a process aimed at partitioning the data by date_open (though I’m not sure that it looks like a good way to go about partitioning in this case). Before doing anything that’s likely to be expensive to an Oracle database it’s always a good idea to see if you can visualize what Oracle will have to do and what the execution steps will be, and where the workload will appear. Are these scenarios all the same and, if not, how do they differ? If you don’t know your data and the impact of your delete you can always ask the database – for example:

select
        rows_in_block,
        count(*)                                     blocks,
        rows_in_block * count(*)                     row_count,
        sum(count(*)) over (order by rows_in_block)                 running_blocks,
        sum(rows_in_block * count(*)) over (order by rows_in_block) running_rows
from
        (
        select 
                dbms_rowid.rowid_relative_fno(rowid), 
                dbms_rowid.rowid_block_number(rowid),
                count(*)                                rows_in_block
        from 
                t1
--
--      where   date_open >= add_months(sysdate, -60)
--      where   date_open <  add_months(sysdate, -60)
--
--      where   date_closed >= add_months(sysdate, -60)
--      where   date_closed <  add_months(sysdate, -60)
--
--      where   substr(client_ref,2,1)  >= 'F'
--      where   substr(client_ref,2,1)  < 'F'
--
        group by 
                dbms_rowid.rowid_relative_fno(rowid), 
                dbms_rowid.rowid_block_number(rowid) 
        )
group by
        rows_in_block
order by
        rows_in_block
;

You’ll notice that I’ve got six commented predicates (in three complementary pairs) in this query. The query is basically designed to give me a summary of how many blocks hold how many rows, but each pair of predicates gives me some idea of the effect of each of my scenarios – one from each pair tells me something about the volume and pattern of data I’ll be deleting, the other tells me about the volume and pattern of data that would survive the delete. The former could give you some idea of how many different blocks you will have to modify and how much work that will take, the latter would help you understand how the system might behave afterwards. With a little bit of SQL*Plus formatting here’s the output after creating the data:

                                              Blocks           Rows
Rows per block   Blocks         Rows   Running total   Running total
-------------- -------- ------------   -------------   -------------
            27        1           27               1              27
            49  203,877    9,989,973         203,878       9,990,000
            50      200       10,000         204,078      10,000,000
               --------
sum             204,078

And here’s the output showing what the surviving data would look like if we delete all rows that opened more than 5 years ago (i.e. use the predicate date_open >= add_months(sysdate, -60)).

                                              Blocks           Rows
Rows per block   Blocks           Rows Running total  Running total
-------------- -------- -------------- ------------- --------------
            27        1             27             1             27
            42        1             42             2             69
            49  102,014      4,998,686       102,016      4,998,755
               --------
sum             102,016

That’s rather nice – roughly speaking we’ve emptied out half the blocks in the table and left the other half untouched. If we tried a “shrink space” now we would simply be copying the rows from the 2nd half of the table to the first half – we’d generate a huge volume of undo and redo, but the clustering factor (or, specifically, the avg_data_blocks_per_key representation of the clustering_factor) of any indexes would probably be pretty much unchanged. Alternatively if we decide to leave the empty space as it is any new data would simply start filling the empty space very efficiently (almost as if it were using newly allocated extents) from the start of the table – again we’d see the clustering factor (i.e. avg_data_blocks_per_key) of any indexes pretty much unchanged. Compare this with the consequences of deleting all the rows that closed more than 5 years ago (i.e. what do we see left if we use the predicate date_closed >= add_months(sysdate, -60)) – the report is rather longer:

                                              Blocks           Rows
Rows per block   Blocks           Rows Running total  Running total
-------------- -------- -------------- ------------- --------------
             1        5              5             5              5
             2       22             44            27             49
             3      113            339           140            388
             4      281          1,124           421          1,512
             5      680          3,400         1,101          4,912
             6    1,256          7,536         2,357         12,448
             7    1,856         12,992         4,213         25,440
             8    2,508         20,064         6,721         45,504
             9    2,875         25,875         9,596         71,379
            10    2,961         29,610        12,557        100,989
            11    2,621         28,831        15,178        129,820
            12    2,222         26,664        17,400        156,484
            13    1,812         23,556        19,212        180,040
            14    1,550         21,700        20,762        201,740
            15    1,543         23,145        22,305        224,885
            16    1,611         25,776        23,916        250,661
            17    1,976         33,592        25,892        284,253
            18    2,168         39,024        28,060        323,277
            19    2,416         45,904        30,476        369,181
            20    2,317         46,340        32,793        415,521
            21    2,310         48,510        35,103        464,031
            22    2,080         45,760        37,183        509,791
            23    1,833         42,159        39,016        551,950
            24    1,696         40,704        40,712        592,654
            25    1,769         44,225        42,481        636,879
            26    1,799         46,774        44,280        683,653
            27    2,138         57,726        46,418        741,379
            28    2,251         63,028        48,669        804,407
            29    2,448         70,992        51,117        875,399
            30    2,339         70,170        53,456        945,569
            31    2,286         70,866        55,742      1,016,435
            32    1,864         59,648        57,606      1,076,083
            33    1,704         56,232        59,310      1,132,315
            34    1,566         53,244        60,876      1,185,559
            35    1,556         54,460        62,432      1,240,019
            36    1,850         66,600        64,282      1,306,619
            37    2,131         78,847        66,413      1,385,466
            38    2,583         98,154        68,996      1,483,620
            39    2,966        115,674        71,962      1,599,294
            40    2,891        115,640        74,853      1,714,934
            41    2,441        100,081        77,294      1,815,015
            42    1,932         81,144        79,226      1,896,159
            43    1,300         55,900        80,526      1,952,059
            44      683         30,052        81,209      1,982,111
            45      291         13,095        81,500      1,995,206
            46      107          4,922        81,607      2,000,128
            47       32          1,504        81,639      2,001,632
            48        3            144        81,642      2,001,776
            49  122,412      5,998,188       204,054      7,999,964
               --------
sum             204,054

In this case we have roughly 60% of the blocks still holding the original 49 rows, but the rest of the blocks in the table have suffered anything from virtually no deletions to complete emptying (if you check the total number of blocks reported against the total in the first report you’ll notice that there must be a few blocks (24) which are now completely empty). How many of those blocks are now available for insert? Here’s a quick calculation: we had 49 rows in most of our blocks which would have been filled to 90% (default pctree = 10), so a block will drop to the 75% mark (which is when ASSM will flag it as having free space) when it has less than 41 rows in it (49 * 75 / 90): of our 204,000 blocks roughly 75,000 match that criterion (checking the “Blocks Running Total” column). In practical terms this means that if we shrink the table the avg_data_blocks_per_key will increase significantly, and if we leave the table as is and allow new data to be inserted in the free space then the avg_data_blocks_per_key will also increase significantly.

Index Space

The previous section showed some simple SQL that gave you an idea of how space would appear (or data would remain) in the table – can we do something similar for indexes? Inevitably the answer is yes, but the code that answers the question “what would this index look like after I deleted data matching predicate X” is a more expensive piece of code to run than the equivalent for tables. To start with here’s a simple piece of code to check just the current content of an index:

select
        rows_per_leaf, count(*) leaf_blocks
from    (
        select
                /*+ index_ffs(t1(client_ref)) */
                sys_op_lbid(94255, 'L', t1.rowid)       leaf_block,
                count(*)                                rows_per_leaf
        from
                t1
        where
                client_ref is not null
        group by
                sys_op_lbid(94255, 'L', t1.rowid)
        )
group by
        rows_per_leaf
order by
        rows_per_leaf
;

The call to sys_op_lbid() takes a table rowid as one of its inputs and returns something that looks like the rowid for the first row of a block, and that block’s address is the address of the index leaf block that holds the index entry for the supplied table rowid. The other two parameters are the object_id of the index (or partition/subpartition if the index is partitioned), and a flag which identifies the specific use of the function – an ‘L’ in our case. The hint to use an index fast full scan on the target index is necessary – any other path can return the wrong result – and the “client_ref is not null” is necessary to ensure that the query can actually use the index_ffs path legally.

For my initial data set the index had 448 index entries in every block except one (presumably the last, at 192 rows). As you can see, even this simple query has to be crafted to meet the requirements of each index – and because an index fast full scan is needed to get the correct result we have to do something even more unusual to see how our massive delete would affect the index. Here’s a sample to show how we would find out what effect an attempt to delete the rows opened more than five years ago would have on the client_ref index:

select
        rows_per_leaf,
        count(*)                                    	blocks,
        rows_per_leaf * count(*)                    	row_count,
        sum(count(*)) over (order by rows_per_leaf)                 running_blocks,
        sum(rows_per_leaf * count(*)) over (order by rows_per_leaf) running_rows
from    (
        select
                /*+ leading(v1 t1) use_hash(t1) */
                leaf_block, count(*) rows_per_leaf
        from    (
                select
                        /*+ no_merge index_ffs(t1(client_ref)) */
                        sys_op_lbid(94255, 'L', t1.rowid)       leaf_block,
                        t1.rowid                                rid
                from
                        t1
                where
                        client_ref is not null
                )       v1,
                t1
        where
                t1.rowid = v1.rid
        and     date_open <  add_months(sysdate, -60)
        group by
                leaf_block
        )
group by
        rows_per_leaf
order by
        rows_per_leaf
;

As you can see we start off with an inline (hinted non-mergeable) view to attach the index leaf block id to every single table rowid, then we join that set of rowids back to the table – joining by rowid and forcing a hash join. I’ve hinted the hash join because it’s (probably) the most efficient strategy but although I’ve put in a leading() hint I haven’t included a hint about swapping (or not) the join inputs – I’m going to let the optimizer decide which of the two data sets is the smaller and therefore more appropriate for building the hash table.

In this particular case the optimizer was able to use an index-only access path to find all the rowids for the rows where date_open was earlier than 5 years ago, even so (partly because my pga_aggregate_target was relatively small and the hash join spilled to (solid state) disc) the query took 3 minutes 15 seconds to complete compared to the 1.5 seconds for the previous query which I happened to run while the entire index was cached. Here’s an extract of the output:

                                             Blocks           Rows
Rows_per_leaf   Blocks           Rows Running total  Running total
------------- -------- -------------- ------------- --------------
          181        2            362             3            458
          186        2            372             5            830
          187        2            374             7          1,204
          188        1            188             8          1,392
...
          210      346         72,660         2,312        474,882
          211      401         84,611         2,713        559,493
...
          221      808        178,568         8,989      1,921,410
          222      851        188,922         9,840      2,110,332
          223      832        185,536        10,672      2,295,868
...
          242      216         52,272        21,320      4,756,575
          243      173         42,039        21,493      4,798,614
          244      156         38,064        21,649      4,836,678
...
          265        1            265        22,321      5,003,718
          266        1            266        22,322      5,003,984

We’re going to modify 22,322 leaf blocks – that’s every single leaf block in the index; and the number of rows we delete from a leaf block varies from just one to 266. I’ve selected a few rows at a time from the 83 lines output, but you can probably still see that the pattern seems to follow the normal distribution, centered around the 222 (50%) mark. If we do that delete it should be clear that we’re going expend a lot of effort updating this index; even then the simple numbers of “how many rows deleted per leaf block” doesn’t tell us the whole story about the work we’ll do – we don’t know whether we will (for example) delete all 266 index entries at the same time from the last block reported above, or whether we’ll be jumping around the index extremely randomly and find ourselves continuously revisiting that block to delete one index entry at a time (we could write some SQL to get some idea of that aspect of the problem, and it’s not particularly difficult to do, but I’ll leave that as an exercise for the reader). The difference in workload could be dramatic and may be enough to make us drop (or mark unreadable) and rebuild some indexes rather than maintaining them during the delete. So in the next installment we’ll look at what aspects of the workload we need to consider, and how different deletion strategies can have a significant impact on that workload.

Tags: