05 July 2011

24858 views

24 0

Arrays in SQL that Avoid Repeated Groups

It is certainly possible to fake an Array in SQL, but there are only a few occasions when it would be the best design. Most often, the wish for an array in SQL is a sign of a forlorn struggle against poorly-normalised data. One of the worst sins against Codd is the repeating group, as Joe explains.

For a long time, procedural languages have used arrays to hold and manipulate data, so programmers are used to designing data with them. The relational model states that all data is shown in tables, and we have no other data structures. To be in First Normal Form (1NF) a table can have no repeating groups, which mean that it has only columns of scalar values. You can make a good case that you can only declare a table in the 1NF in SQL. But that does not stop programmers from finding ways to “fake it” and make their SQL look like their familiar application languages.

In procedural languages, arrays are stored in “row major” or “column major” order. Go to Wikipedia (Row-major order) for details, but the important concept is how to take an n-dimensional array and “flatten” it into sequential physical storage. This is important for procedural languages because it determines the algorithms used to access the array elements. In SQL, we do not care about any physical storage.

It is quite common to see repeated groups used in SQL to attempt to simulate an Array. An example of a repeated group is an employee table with a set of columns for dependents, as follows:

CREATE TABLE Personnel

(emp_id INTEGER NOT NULL PRIMARY KEY,

emp_name VARCHAR(25) NOT NULL,

kid_name_1 VARCHAR(25),

kid_name_2 VARCHAR(25),

kid_name_3 VARCHAR(25),

kid_name_4 VARCHAR(25),

kid_name_5 VARCHAR(25),

..);

This table has many problems. All the columns in the repeating group must be NULL-able, otherwise you cannot represent employees with zero, one, two, three, or four dependent children. The alternative of requiring all employees to have exactly five kids does not work. Likewise, the first employee that has six children messes up the table. And you cannot require him to ditch one of his dependents.

If “number one son” dies, you have to decide to leave his slot NULL or to move the others up one notch. Where would you put the name of a new daughter in the group? Position in a repeating group often has meaning in itself, so you cannot use a single simple UPDATE, INSERT, or DELETE procedure for such tables.

Queries are much more difficult. As an exercise, try to write queries against the Personnel table to find:

All kids named “George” by the same employee. George Foreman should be the only answer to this one.
All employees with three or more offspring
All employees with a kid who has the same name as they do (find the juniors)
Pairs of employees whose children have the same names

You are unlikely to enjoy this task. Using children makes the silliness easy to see.

Representing Arrays in SQL

SQL cannot represent arrays directly, but vendors often provide array language extensions. Two methods for supporting arrays are to have columns with “array” data types (as a whole) or to allow the referencing of groups of columns by subscript (element by element). Subscripts are also called array indexes, but that term can be confused with table indexes in SQL, so I use the term “subscript” instead.

An array in other programming languages has a name and subscripts by which you reference the array elements. Typically, the array elements all have the same data type and the subscripts are all integers. Some languages start numbering at zero, some at one, and others let the user set the upper and lower bounds. A Pascal array declaration, for example, would look like this:

1	MyArray : ARRAY [1..5] OF INTEGER;

This would have elements MyArray[1], MyArray[2], MyArray[3], MyArray[4], and MyArray[5]. The same structure is often mapped into a SQL declaration as:

CREATE TABLE MyArray1

(element1 INTEGER NOT NULL,

element2 INTEGER NOT NULL,

element3 INTEGER NOT NULL,

element4 INTEGER NOT NULL,

element5 INTEGER NOT NULL);

You have to go to all this trouble because there is no subscript that you can iterate in a loop. In fact, there is no loop control structure at all! You must use column names.

A better alternative approach to faking an array in the relational model is to map arrays into a table with an integer column for each subscript, as follows:

CREATE TABLE MyArray2

(i INTEGER NOT NULL CHECK (i BETWEEN 1 AND 5),

element INTEGER NOT NULL,

PRIMARY KEY (i));

This looks more complex than the first approach, but it is closer to what the original Pascal, or any other procedural language, declaration does behind the scenes. Subscripts resolve to unique physical addresses, so it is not possible to have two values for MyArray[i]; hence i is a key. The compiler will check to see that the subscripts are within the declared range using the CHECK() clause.

The first advantage of this approach is that you can easily handle multidimensional arrays by adding another column for each subscript. The Pascal declaration:

1	ThreeD : ARRAY [1..3, 1..4, 1..5] OF REAL;

becomes:

CREATE TABLE ThreeD

(i INTEGER NOT NULL CHECK (i BETWEEN 1 AND 3),

j INTEGER NOT NULL CHECK (j BETWEEN 1 AND 4),

k INTEGER NOT NULL CHECK (k BETWEEN 1 AND 5),

element INTEGER NOT NULL,

PRIMARY KEY (i, j, k));

Obviously, GROUP BY clauses on the subscript columns will produce row and column totals. If you used the original one-element/one-column approach, the table declaration would have 120 columns, named “element111” to “element345.” This would be too many names to handle in any reasonable way.

This idiom can support matrix math, but I am not going to go into that in this article. If anyone is interested, try to write the following operations for a classic two dimensional matrix:

Matrix equality
Matrix addition
Matrix multiplications
Matrix sorting
Compute a determinant

Let’s go back the Personnel table and declare it using this approach:

CREATE TABLE Personnel

(emp_id INTEGER NOT NULL PRIMARY KEY,

emp_name VARCHAR(25) NOT NULL,

kid_name VARCHAR(25) NOT NULL,

birth_seq INTEGER NOT NULL, --subscript with a good name

..);

Actually, you should normalize this table further by splitting it into Personnel and Dependents tables. The Dependents table needs its own constraints and references back to the Personnel table. But the way to think about it is that you are doing explicitly what has been done implicitly.

CREATE TABLE Personnel

(emp_id INTEGER NOT NULL PRIMARY KEY,

emp_name VARCHAR(25) NOT NULL,

..);

CREATE TABLE Dependents

(emp_id INTEGER NOT NULL

REFERENCES Personnel(emp_id)

ON DELETE CASCADE

ON UPDATE CASCADE,

birth_seq INTEGER NOT NULL

CHECK (birth_seq > 0),

PRIMARY KEY (emp_id, birth_seq),

kid_name VARCHAR(25) NOT NULL,);

The four proposed queries are now simple and will work for families of any size.

All kids named “George”

1
2
3

SELECT kid_name
FROM Dependents
WHERE kid_name = 'George';
All Personnel with three or more offspring

1
2
3
4
5

SELECT P.emp_name, COUNT(*) AS kid_cnt
FROM Dependents AS D, Personnel AS P
WHERE D.emp_id = P.emp_id
GROUP BY P.emp_name
HAVING COUNT(*) >= 3;
All Personnel with a kid who has the same name as they do ( find the “juniors”). George Foreman will get several lines in the output…

1
2
3
4

SELECT P.emp_name AS junio, D.birth_seq
FROM Dependents AS D, Personnel AS P
WHERE D.emp_id = P.emp_id
AND D.kid_name = P.emp_name;

The query “find pairs of employees whose children all have the same names” is very restrictive. Both Mr. X and Mr. Y must have exactly the same number of dependents, and both sets of names must match and be in the same birth_seq. (We can assume that nobody has two children with the same name — except George Foreman.) Begin by constructing a table of sample data:

INSERT INTO Dependents (emp_id, kid_name, birth_seq)

VALUES (1, 'Dick', 2), (1, 'Harry', 3), (1, 'Tom', 1),

(2, 'Dick', 3), (2, 'Harry', 1), (2, 'Tom', 2),

(3, 'Dick', 2), (3, 'Harry', 3), (3, 'Tom', 1),

(4, 'Harry', 1), (4, 'Tom', 2), (5, 'Curly', 2),

(5, 'Harry', 3), (5, 'Moe', 1):

In this test data, employees 1, 2, and 3 all have kids named Tom, Dick, and Harry. The birth order is the same for the children of employee 1 and 3. For testing purposes, you might consider adding an extra child to the family of employee 3, and so on.

While there are many ways to do this query, this approach gives you some flexibility that others do not. Construct a VIEW that gives you the number of each employee’s dependents:

CREATE VIEW Familysize (emp_id, dependent_cnt)

SELECT emp_id, COUNT(*)

FROM Dependents

GROUP BY emp_id;

Create a second VIEW that holds pairs of Personnel who have families of the same size:

CREATE VIEW Samesize_Families (emp_id1, emp_id2, dependent_cnt)

SELECT F1.emp_id, F2.emp_id, F1.dependent_cnt

FROM Familysize AS F1, Familysize AS F2

WHERE F1.dependent_cnt = F2.dependent_cnt;

You can test for set equality by doing a self-join on the dependents of Personnel with the same size families. If one set can be mapped onto another with no children left over, then the two sets are equal:

SELECT D1.emp_id AS first_emp_id,

D2.emp_id AS second_emp_id,

S1.dependent_cnt

FROM Dependents AS D1, Dependents AS D2, Samesize_Families AS S1

WHERE S1.emp_id1 = D1.emp_id

AND S1.emp_id2 = D2.emp_id

AND D1.kid_name = D2.kid_name

AND D1.birth_seq = D2.birth_seq

GROUP BY D1.emp_id, D2.emp_id, S1.dependent_cnt

HAVING COUNT(*) = S1.dependent_cnt;

If birth order is not important, then drop the predicate “D1.birth_seq = D2.birth_seq” from the query. Obviously, all of the VIEWs could be done with CTEs.

Flattening a Table into an Array

In a report, you often want to see an array distributed horizontally on a line. The one-element-per-column approach to mapping arrays into SQL was based on seeing such reports and duplicating that structure in a table. Yes, you can use non-relational proprietary extensions like PIVOT today. But you need to know the “subscripts and value” approach so that you can read old code.

Imagine, for example, a company that collects monthly sales reports with each sales representative’s name, the month_name, and each rep’s total dollar figure. You want to produce a report with one line for each person and his or her year’s work across the page. The sales reports table looks like Listing 1:

CREATE TABLE Sales

(salerep_name CHAR(25) NOT NULL,

month_name CHAR(3) NOT NULL, -- the subscript

sales_amt DECIMAL (10,2) NOT NULL, -- the element

PRIMARY KEY (salerep_name, month_name));

You need to flatten out this table to get the desired rows for the report. First create a working storage table from which the report can be built.

CREATE TABLE ReportWork -- working storage

(salerep_name CHAR(25) PRIMARY KEY,

jan DECIMAL(8,2) DEFAULT (0.00) NOT NULL,

feb DECIMAL(8,2) DEFAULT (0.00) NOT NULL,

mar DECIMAL(8,2) DEFAULT (0.00) NOT NULL,

"dec" DECIMAL(8,2) DEFAULT NOT NULL (0.00));

NOTE: DEC is a reserved shorthand word for DECIMAL in Standard SQL, so it has to be double quoted.

Notice that the primary key is the sales rep’s name and that the monthly data columns default to zero dollars. The first step is to get a row for every sales rep in the working table:

INSERT INTO ReportWork (salerep_name)

SELECT emp_name

FROM Personnel

WHERE job_title = 'Salesrep';

Because of the DEFAULT() clause, the other columns will fill with zero amounts. If your SQL does not have a DEFAULT() clause, simply add a dozen constant zeros to the SELECT list.

The data from the Sales table is then added into the working table with a series of UPDATEs of the form:

UPDATE ReportWork AS RW

SET jan = (SELECT sales_amt

FROM Sales AS S1

WHERE S1.month_name = 'Jan'

AND S1.salerep_name = RW.salerep_name);

Or, using newer SQL features:

UPDATE ReportWork AS RW

SET jan = (SELECT sales_amt

FROM Sales AS S1

WHERE S1.month_name = 'Jan'

AND S1.salerep_name = RW.salerep_name);

or using newer SQL features:

MERGE INTO ReportWork AS RW

USING Sales AS S1

ON S1.salerep_name = RW.salerep_name

WHEN MATCHED

THEN UPDATE

SET jan

= CASE WHEN S1.month_name = 'jan' THEN S1.sales_amt ELSE 0.00 END,

feb

= CASE dec S1.month_name = 'feb' THEN S1.sales_amt ELSE 0.00 END,

dec

= CASE WHEN S1.month_name = 'dec' THEN S1.sales_amt ELSE 0.00 END,

This basic technique can be modified to handle NULLs, collect totals, and so forth. The trick is in having a column in the flatten table that matches a value from the 1NF table.

Arrays in SQL that Avoid Repeated Groups

Representing Arrays in SQL

Flattening a Table into an Array

Subscribe for more articles

Rate this article

Joe Celko

Related articles

Database Subsetting and Data Extraction

What’s new in T-SQL in SQL Server 2022

Snake draft sorting in SQL Server, part 2

Tags