Massive Deletes – Part 2

In Part 1 of this short series I provided an informal description of a couple of scenarios where we might do a large-scale delete from a table. Without a concrete example though, it can be hard to imagine how the nature of the data deleted and the access paths available could affect the impact a large delete operation could have on your system, so I’m going to spend most of this article talking about a couple of tests on a generated data set. The article may seem a bit long but quite a lot of the space will be taken up by tabular reports.

Sample Data

With the ever-increasing power and scale of hardware it becomes harder and harder to agree on what we might mean by a “large table” or “massive delete” – to one person 1 million rows might seem large, to another 100 million might seem fairly ordinary. I’m going to go with a compromise 10 million rows representing an investment system that has been growing by 1M rows per year for 10 years and has reached a segment size of 1.6GB. This table is, of course, just one of several making up the full system and there will at some point be related data to worry about but, for the present, we’re going to isolate just this table and consider only the table itself and the 4 indexes on it. Here’s the code to generate the data set:

execute dbms_random.seed(0)

create table t1 (

id not null,

date_open, date_closed,

deal_type, client_ref,

small_vc, padding

)

nologging

with generator as (

select /*+ materialize cardinality(1e4) */

rownum id

from dual

connect by

rownum <= 1e4

)

select

1e4 * (g1.id - 1) + g2.id id,

trunc(

add_months(sysdate, - 120) +

(1e4 * (g1.id - 1) + g2.id)* 3652 / 1e7

) date_open,

trunc(

add_months(

add_months(sysdate, - 120) +

(1e4 * (g1.id - 1) + g2.id) * 3652 / 1e7,

12 * trunc(dbms_random.value(1,6))

)

) date_closed,

cast(dbms_random.string('U',1) as varchar2(1)) deal_type,

cast(dbms_random.string('U',4) as varchar2(4)) client_ref,

lpad(1e4 * (g1.id - 1) + g2.id,10) small_vc,

rpad('x',100,'x') padding

from

generator g1,

generator g2

where

g1.id <= 1e3

and g2.id <= 1e4

;

execute dbms_stats.gather_table_stats(user,'t1',method_opt=>'for all columns size 1')

alter table t1 add constraint t1_pk primary key(id) using index nologging;

create index t1_dt_open on t1(date_open) nologging;

create index t1_dt_closed on t1(date_closed) nologging;

create index t1_client on t1(client_ref) nologging;

It’s not immediately obvious but the code generates 10 million rows; the date_open starts 120 months (10 years, 3,652 days) in the past and the arithmetic used to increment the value means the most recent entry is at the current date. The date_closed is an integer number of years between 1 and 5 (inclusive) added to the date_open (the table is a simple-minded model of recording fixed-term investments). The deal_type is a randomly-generated single uppercase character – producing 26 distinct values of equal volumes of data; the client_ref is randomly generated as a fixed length string of 4 uppercase characters, giving about half a million combinations and 20 rows per combination.

As a side note – I’ve generated the data set without using rownum anywhere in the high-volume select; this would make it possible for me to use parallel execution to generate the data more quickly (both the “level” and the “rownum” pseudo columns limit Oracle’s ability to use parallel execution). In this case though, because I want the id column to model a sequentially generated value being stored in order of arrival, I’m running the code serially.

Scale

On my laptop, running 12.1.0.2 on a Linux 5 VM I got the following times for creating the data, gathering stats and creating the indexes:

Table creation: 7:06.40

Stats gathered: 0:10.54

PK added: 0:10.94

Index created: 0:10.79 (date_open)

Index created: 0:12.17 (date_closed)

Index created: 0:13.65 (client_ref)

This, of course, is where we have to start asking questions about realism and how different systems might behave. The virtual machine had 4GB of memory allocated (of which 1.6GB was set aside for the memory_target), and a nominal 2 CPUs running at 2.8GHz from a quad core CPU – but possibly most significant was the fact that the machine has 1TB of solid state disc, so doesn’t lose much time on physical I/O. The database was configured with 3 redo log groups of 200MB each (to encourage some delays for log file checkpoint and log file switch waits), the logs were duplexed but the instance wasn’t running in archivelog mode. After stats collection the table block count was roughly 204,000 blocks with 49 rows per block in most blocks, the PK and client indexes had about 22,000 leaf blocks and the two date indexes about 26,500 leaf blocks.

Quality

It’s important when trying to use models like this to question how close to reality they get: so how many flaws can you spot in what I’ve done so far? First of all, the id column is too perfect – ids appear in perfect sequence in the table while in real-life with concurrent single-row inserts there would be a little ‘jitter’, with ranges of consecutive values being scattered over a small number of blocks; this is probably not terribly important. More significantly, I’ve created the indexes after inserting all the data, which means the indexes are physically as perfect as they could be (and with 10% free space per leaf block). I really ought to have created the table and indexes empty then run several concurrent scripts that did single row inserts using a sequence to generate ids – but the last time I did that sort of thing the run time jumped by a factor of 40 (so that’s around 5 hours per run, and I want to drop and recreate this data several times). Again that’s not really terribly significant, though I ought to remember that on a live system the average available space in index leaf blocks might be closer to 30% at any moment, with a significant variation from block to block, and I might want (on a production system) to check the state of the index leaf blocks of the date-based indexes from time to time, and the date_open index in particular.

Scenarios

Despite the fact that any timing is highly dependent on machine configuration and resources and that the model is over-simplified we can still get some interesting information from a few basic tests. Let’s start with a few scenarios relating to business-based decisions:

Delete all deals that closed more than 5 years ago
Delete all the deals where the client_ref starts with ‘A’ – ‘E’
Delete all deals that opened more than 5 years ago

We could imagine that option (a) is a basic archiving requirement – maybe the data has been copied to another table before deletion. Option (b) perhaps tells us that the client_ref has been (ab)used to encode some significant classification in the first letter for the reference and we’re splitting the data into two processing sets. Option (c) could be part of a process aimed at partitioning the data by date_open (though I’m not sure that it looks like a good way to go about partitioning in this case). Before doing anything that’s likely to be expensive to an Oracle database it’s always a good idea to see if you can visualize what Oracle will have to do and what the execution steps will be, and where the workload will appear. Are these scenarios all the same and, if not, how do they differ? If you don’t know your data and the impact of your delete you can always ask the database – for example:

select

rows_in_block,

count(*) blocks,

rows_in_block * count(*) row_count,

sum(count(*)) over (order by rows_in_block) running_blocks,

sum(rows_in_block * count(*)) over (order by rows_in_block) running_rows

from

(

select

dbms_rowid.rowid_relative_fno(rowid),

dbms_rowid.rowid_block_number(rowid),

count(*) rows_in_block

from

-- where date_open >= add_months(sysdate, -60)

-- where date_open < add_months(sysdate, -60)

-- where date_closed >= add_months(sysdate, -60)

-- where date_closed < add_months(sysdate, -60)

-- where substr(client_ref,2,1) >= 'F'

-- where substr(client_ref,2,1) < 'F'

group by

dbms_rowid.rowid_relative_fno(rowid),

dbms_rowid.rowid_block_number(rowid)

)

group by

rows_in_block

order by

rows_in_block

;

You’ll notice that I’ve got six commented predicates (in three complementary pairs) in this query. The query is basically designed to give me a summary of how many blocks hold how many rows, but each pair of predicates gives me some idea of the effect of each of my scenarios – one from each pair tells me something about the volume and pattern of data I’ll be deleting, the other tells me about the volume and pattern of data that would survive the delete. The former could give you some idea of how many different blocks you will have to modify and how much work that will take, the latter would help you understand how the system might behave afterwards. With a little bit of SQL*Plus formatting here’s the output after creating the data:

Blocks Rows

Rows per block Blocks Rows Running total Running total

-------------- -------- ------------ ------------- -------------

27 1 27 1 27

49 203,877 9,989,973 203,878 9,990,000

50 200 10,000 204,078 10,000,000

--------

sum 204,078

And here’s the output showing what the surviving data would look like if we delete all rows that opened more than 5 years ago (i.e. use the predicate date_open >= add_months(sysdate, -60)).

Blocks Rows

Rows per block Blocks Rows Running total Running total

-------------- -------- -------------- ------------- --------------

27 1 27 1 27

42 1 42 2 69

49 102,014 4,998,686 102,016 4,998,755

--------

sum 102,016

That’s rather nice – roughly speaking we’ve emptied out half the blocks in the table and left the other half untouched. If we tried a “shrink space” now we would simply be copying the rows from the 2nd half of the table to the first half – we’d generate a huge volume of undo and redo, but the clustering factor (or, specifically, the avg_data_blocks_per_key representation of the clustering_factor) of any indexes would probably be pretty much unchanged. Alternatively if we decide to leave the empty space as it is any new data would simply start filling the empty space very efficiently (almost as if it were using newly allocated extents) from the start of the table – again we’d see the clustering factor (i.e. avg_data_blocks_per_key) of any indexes pretty much unchanged. Compare this with the consequences of deleting all the rows that closed more than 5 years ago (i.e. what do we see left if we use the predicate date_closed >= add_months(sysdate, -60)) – the report is rather longer:

Blocks Rows

Rows per block Blocks Rows Running total Running total

-------------- -------- -------------- ------------- --------------

1 5 5 5 5

2 22 44 27 49

3 113 339 140 388

4 281 1,124 421 1,512

5 680 3,400 1,101 4,912

6 1,256 7,536 2,357 12,448

7 1,856 12,992 4,213 25,440

8 2,508 20,064 6,721 45,504

9 2,875 25,875 9,596 71,379

10 2,961 29,610 12,557 100,989

11 2,621 28,831 15,178 129,820

12 2,222 26,664 17,400 156,484

13 1,812 23,556 19,212 180,040

14 1,550 21,700 20,762 201,740

15 1,543 23,145 22,305 224,885

16 1,611 25,776 23,916 250,661

17 1,976 33,592 25,892 284,253

18 2,168 39,024 28,060 323,277

19 2,416 45,904 30,476 369,181

20 2,317 46,340 32,793 415,521

21 2,310 48,510 35,103 464,031

22 2,080 45,760 37,183 509,791

23 1,833 42,159 39,016 551,950

24 1,696 40,704 40,712 592,654

25 1,769 44,225 42,481 636,879

26 1,799 46,774 44,280 683,653

27 2,138 57,726 46,418 741,379

28 2,251 63,028 48,669 804,407

29 2,448 70,992 51,117 875,399

30 2,339 70,170 53,456 945,569

31 2,286 70,866 55,742 1,016,435

32 1,864 59,648 57,606 1,076,083

33 1,704 56,232 59,310 1,132,315

34 1,566 53,244 60,876 1,185,559

35 1,556 54,460 62,432 1,240,019

36 1,850 66,600 64,282 1,306,619

37 2,131 78,847 66,413 1,385,466

38 2,583 98,154 68,996 1,483,620

39 2,966 115,674 71,962 1,599,294

40 2,891 115,640 74,853 1,714,934

41 2,441 100,081 77,294 1,815,015

42 1,932 81,144 79,226 1,896,159

43 1,300 55,900 80,526 1,952,059

44 683 30,052 81,209 1,982,111

45 291 13,095 81,500 1,995,206

46 107 4,922 81,607 2,000,128

47 32 1,504 81,639 2,001,632

48 3 144 81,642 2,001,776

49 122,412 5,998,188 204,054 7,999,964

--------

sum 204,054

In this case we have roughly 60% of the blocks still holding the original 49 rows, but the rest of the blocks in the table have suffered anything from virtually no deletions to complete emptying (if you check the total number of blocks reported against the total in the first report you’ll notice that there must be a few blocks (24) which are now completely empty). How many of those blocks are now available for insert? Here’s a quick calculation: we had 49 rows in most of our blocks which would have been filled to 90% (default pctree = 10), so a block will drop to the 75% mark (which is when ASSM will flag it as having free space) when it has less than 41 rows in it (49 * 75 / 90): of our 204,000 blocks roughly 75,000 match that criterion (checking the “Blocks Running Total” column). In practical terms this means that if we shrink the table the avg_data_blocks_per_key will increase significantly, and if we leave the table as is and allow new data to be inserted in the free space then the avg_data_blocks_per_key will also increase significantly.

Index Space

The previous section showed some simple SQL that gave you an idea of how space would appear (or data would remain) in the table – can we do something similar for indexes? Inevitably the answer is yes, but the code that answers the question “what would this index look like after I deleted data matching predicate X” is a more expensive piece of code to run than the equivalent for tables. To start with here’s a simple piece of code to check just the current content of an index:

select

rows_per_leaf, count(*) leaf_blocks

from (

select

/*+ index_ffs(t1(client_ref)) */

sys_op_lbid(94255, 'L', t1.rowid) leaf_block,

count(*) rows_per_leaf

from

where

client_ref is not null

group by

sys_op_lbid(94255, 'L', t1.rowid)

)

group by

rows_per_leaf

order by

rows_per_leaf

;

The call to sys_op_lbid() takes a table rowid as one of its inputs and returns something that looks like the rowid for the first row of a block, and that block’s address is the address of the index leaf block that holds the index entry for the supplied table rowid. The other two parameters are the object_id of the index (or partition/subpartition if the index is partitioned), and a flag which identifies the specific use of the function – an ‘L’ in our case. The hint to use an index fast full scan on the target index is necessary – any other path can return the wrong result – and the “client_ref is not null” is necessary to ensure that the query can actually use the index_ffs path legally.

For my initial data set the index had 448 index entries in every block except one (presumably the last, at 192 rows). As you can see, even this simple query has to be crafted to meet the requirements of each index – and because an index fast full scan is needed to get the correct result we have to do something even more unusual to see how our massive delete would affect the index. Here’s a sample to show how we would find out what effect an attempt to delete the rows opened more than five years ago would have on the client_ref index:

select

rows_per_leaf,

count(*) blocks,

rows_per_leaf * count(*) row_count,

sum(count(*)) over (order by rows_per_leaf) running_blocks,

sum(rows_per_leaf * count(*)) over (order by rows_per_leaf) running_rows

from (

select

/*+ leading(v1 t1) use_hash(t1) */

leaf_block, count(*) rows_per_leaf

from (

select

/*+ no_merge index_ffs(t1(client_ref)) */

sys_op_lbid(94255, 'L', t1.rowid) leaf_block,

t1.rowid rid

from

where

client_ref is not null

) v1,

where

t1.rowid = v1.rid

and date_open < add_months(sysdate, -60)

group by

leaf_block

)

group by

rows_per_leaf

order by

rows_per_leaf

;

As you can see we start off with an inline (hinted non-mergeable) view to attach the index leaf block id to every single table rowid, then we join that set of rowids back to the table – joining by rowid and forcing a hash join. I’ve hinted the hash join because it’s (probably) the most efficient strategy but although I’ve put in a leading() hint I haven’t included a hint about swapping (or not) the join inputs – I’m going to let the optimizer decide which of the two data sets is the smaller and therefore more appropriate for building the hash table.

In this particular case the optimizer was able to use an index-only access path to find all the rowids for the rows where date_open was earlier than 5 years ago, even so (partly because my pga_aggregate_target was relatively small and the hash join spilled to (solid state) disc) the query took 3 minutes 15 seconds to complete compared to the 1.5 seconds for the previous query which I happened to run while the entire index was cached. Here’s an extract of the output:

Blocks Rows

Rows_per_leaf Blocks Rows Running total Running total

------------- -------- -------------- ------------- --------------

181 2 362 3 458

186 2 372 5 830

187 2 374 7 1,204

188 1 188 8 1,392

...

210 346 72,660 2,312 474,882

211 401 84,611 2,713 559,493

...

221 808 178,568 8,989 1,921,410

222 851 188,922 9,840 2,110,332

223 832 185,536 10,672 2,295,868

...

242 216 52,272 21,320 4,756,575

243 173 42,039 21,493 4,798,614

244 156 38,064 21,649 4,836,678

...

265 1 265 22,321 5,003,718

266 1 266 22,322 5,003,984

We’re going to modify 22,322 leaf blocks – that’s every single leaf block in the index; and the number of rows we delete from a leaf block varies from just one to 266. I’ve selected a few rows at a time from the 83 lines output, but you can probably still see that the pattern seems to follow the normal distribution, centered around the 222 (50%) mark. If we do that delete it should be clear that we’re going expend a lot of effort updating this index; even then the simple numbers of “how many rows deleted per leaf block” doesn’t tell us the whole story about the work we’ll do – we don’t know whether we will (for example) delete all 266 index entries at the same time from the last block reported above, or whether we’ll be jumping around the index extremely randomly and find ourselves continuously revisiting that block to delete one index entry at a time (we could write some SQL to get some idea of that aspect of the problem, and it’s not particularly difficult to do, but I’ll leave that as an exercise for the reader). The difference in workload could be dramatic and may be enough to make us drop (or mark unreadable) and rebuild some indexes rather than maintaining them during the delete. So in the next installment we’ll look at what aspects of the workload we need to consider, and how different deletion strategies can have a significant impact on that workload.

Register for Simple Talk

Massive Deletes – Part 2

Sample Data

Scale

Quality

Scenarios

Index Space

Article tags

About the author

Jonathan Lewis

Jonathan's contributions

Articles

Books

Top topics

Jonathan's latest contributions:

Oracle optimizer Or Expansion Transformations

Oracle optimizer removing or coalescing subqueries

Oracle subquery caching and subquery pushing

Sample Data

Scale

Quality

Scenarios

Index Space

Article tags

Recommended

About the author

Jonathan Lewis

Jonathan's contributions

Articles

Books

Top topics

Jonathan's latest contributions:

Oracle optimizer Or Expansion Transformations

Oracle optimizer removing or coalescing subqueries

Oracle subquery caching and subquery pushing

Cookies on red-gate.com