Arrays in SQL that Avoid Repeated Groups

It is certainly possible to fake an Array in SQL, but there are only a few occasions when it would be the best design. Most often, the wish for an array in SQL is a sign of a forlorn struggle against poorly-normalised data. One of the worst sins against Codd is the repeating group, as Joe explains.

For a long time, procedural languages have used arrays to hold and manipulate data, so programmers are used to designing data with them. The relational model states that all data is shown in tables, and we have no other data structures. To be in First Normal Form (1NF) a table can have no repeating groups, which mean that it has only columns of scalar values. You can make a good case that you can only declare a table in the 1NF in SQL. But that does not stop programmers from finding ways to “fake it” and make their SQL look like their familiar application languages.

In procedural languages, arrays are stored in “row major” or “column major” order. Go to Wikipedia (Row-major order) for details, but the important concept is how to take an n-dimensional array and “flatten” it into sequential physical storage. This is important for procedural languages because it determines the algorithms used to access the array elements. In SQL, we do not care about any physical storage.

It is quite common to see repeated groups used in SQL to attempt to simulate an Array. An example of a repeated group is an employee table with a set of columns for dependents, as follows:

This table has many problems. All the columns in the repeating group must be NULL-able, otherwise you cannot represent employees with zero, one, two, three, or four dependent children. The alternative of requiring all employees to have exactly five kids does not work. Likewise, the first employee that has six children messes up the table. And you cannot require him to ditch one of his dependents.

If “number one son” dies, you have to decide to leave his slot NULL or to move the others up one notch. Where would you put the name of a new daughter in the group? Position in a repeating group often has meaning in itself, so you cannot use a single simple UPDATE, INSERT, or DELETE procedure for such tables.

Queries are much more difficult. As an exercise, try to write queries against the Personnel table to find:

  1. All kids named “George” by the same employee. George Foreman should be the only answer to this one.
  2. All employees with three or more offspring
  3. All employees with a kid who has the same name as they do (find the juniors)
  4. Pairs of employees whose children have the same names

You are unlikely to enjoy this task. Using children makes the silliness easy to see.

Representing Arrays in SQL

SQL cannot represent arrays directly, but vendors often provide array language extensions. Two methods for supporting arrays are to have columns with “array” data types (as a whole) or to allow the referencing of groups of columns by subscript (element by element). Subscripts are also called array indexes, but that term can be confused with table indexes in SQL, so I use the term “subscript” instead.

An array in other programming languages has a name and subscripts by which you reference the array elements. Typically, the array elements all have the same data type and the subscripts are all integers. Some languages start numbering at zero, some at one, and others let the user set the upper and lower bounds. A Pascal array declaration, for example, would look like this:

This would have elements MyArray[1], MyArray[2], MyArray[3], MyArray[4], and MyArray[5]. The same structure is often mapped into a SQL declaration as:

You have to go to all this trouble because there is no subscript that you can iterate in a loop. In fact, there is no loop control structure at all! You must use column names.

A better alternative approach to faking an array in the relational model is to map arrays into a table with an integer column for each subscript, as follows:

This looks more complex than the first approach, but it is closer to what the original Pascal, or any other procedural language, declaration does behind the scenes. Subscripts resolve to unique physical addresses, so it is not possible to have two values for MyArray[i]; hence i is a key. The compiler will check to see that the subscripts are within the declared range using the CHECK() clause.

The first advantage of this approach is that you can easily handle multidimensional arrays by adding another column for each subscript. The Pascal declaration:

becomes:

Obviously, GROUP BY clauses on the subscript columns will produce row and column totals. If you used the original one-element/one-column approach, the table declaration would have 120 columns, named “element111” to “element345.” This would be too many names to handle in any reasonable way.

This idiom can support matrix math, but I am not going to go into that in this article. If anyone is interested, try to write the following operations for a classic two dimensional matrix:

  1. Matrix equality
  2. Matrix addition
  3. Matrix multiplications
  4. Matrix sorting
  5. Compute a determinant

Let’s go back the Personnel table and declare it using this approach:

Actually, you should normalize this table further by splitting it into Personnel and Dependents tables. The Dependents table needs its own constraints and references back to the Personnel table. But the way to think about it is that you are doing explicitly what has been done implicitly.

The four proposed queries are now simple and will work for families of any size.

  1. All kids named “George”

  2. All Personnel with three or more offspring

  3. All Personnel with a kid who has the same name as they do ( find the “juniors”). George Foreman will get several lines in the output…

  4. The query “find pairs of employees whose children all have the same names” is very restrictive. Both Mr. X and Mr. Y must have exactly the same number of dependents, and both sets of names must match and be in the same birth_seq. (We can assume that nobody has two children with the same name — except George Foreman.) Begin by constructing a table of sample data:

In this test data, employees 1, 2, and 3 all have kids named Tom, Dick, and Harry. The birth order is the same for the children of employee 1 and 3. For testing purposes, you might consider adding an extra child to the family of employee 3, and so on.

While there are many ways to do this query, this approach gives you some flexibility that others do not. Construct a VIEW that gives you the number of each employee’s dependents:

Create a second VIEW that holds pairs of Personnel who have families of the same size:

You can test for set equality by doing a self-join on the dependents of Personnel with the same size families. If one set can be mapped onto another with no children left over, then the two sets are equal:

If birth order is not important, then drop the predicate “D1.birth_seq = D2.birth_seq” from the query. Obviously, all of the VIEWs could be done with CTEs.

Flattening a Table into an Array

In a report, you often want to see an array distributed horizontally on a line. The one-element-per-column approach to mapping arrays into SQL was based on seeing such reports and duplicating that structure in a table. Yes, you can use non-relational proprietary extensions like PIVOT today. But you need to know the “subscripts and value” approach so that you can read old code.

Imagine, for example, a company that collects monthly sales reports with each sales representative’s name, the month_name, and each rep’s total dollar figure. You want to produce a report with one line for each person and his or her year’s work across the page. The sales reports table looks like Listing 1:

You need to flatten out this table to get the desired rows for the report. First create a working storage table from which the report can be built.

NOTE: DEC is a reserved shorthand word for DECIMAL in Standard SQL, so it has to be double quoted.

Notice that the primary key is the sales rep’s name and that the monthly data columns default to zero dollars. The first step is to get a row for every sales rep in the working table:

Because of the DEFAULT() clause, the other columns will fill with zero amounts. If your SQL does not have a DEFAULT() clause, simply add a dozen constant zeros to the SELECT list.

The data from the Sales table is then added into the working table with a series of UPDATEs of the form:

Or, using newer SQL features:

This basic technique can be modified to handle NULLs, collect totals, and so forth. The trick is in having a column in the flatten table that matches a value from the 1NF table.