SQL Standard Collection Types

This document describes standard support for collection types in SQL99 and SQL2003. The specifications are informal; for details, see the relevant portions of the standards. None of this is implemented in Farrago yet.

SQL99 Array Type

SQL99 introduces limited support for a single collection type known as arrays. Arrays are variable-sized ordered collections with a declared maximum cardinality. Here's an example of how to declare a column with array type:


CREATE TABLE genome_sequences(
    sequence_id BIGINT NOT NULL PRIMARY KEY,
    chromosome_number INT NOT NULL,
    start_offset BIGINT NOT NULL,
    codons CHAR(3) ARRAY[1000] NOT NULL);

Elements of the array may be declared with almost any datatype (in this example, CHAR(3)). However, in SQL99, nested arrays are illegal (whether the nesting is direct or indirect); you can't declare


    illegal_column1 INT ARRAY[3] ARRAY[5]

nor can you declare


    illegal_column2 ROW(INT,DOUBLE ARRAY[7]) ARRAY[4]

(This restriction is removed by SQL2003.)

Array values are created with the ARRAY constructor:


INSERT INTO genome_sequences VALUES(
    10032423, 3, 432432, 
    ARRAY['CUG', 'AAG', 'GGU', 'ACU', 'CUU', 'GGU', 'UGG', 'UAA']);

The length of an array is retrieved with the CARDINALITY function:


SELECT sequence_id,CARDINALITY(codons) as sequence_length
FROM genome_sequences
ORDER BY sequence_length;

And the actual values can be selected with the usual bracket syntax:


SELECT 
    sequence_id,
    codons[1] as first_codon,
    codons[CARDINALITY(codons)] as last_codon
FROM genome_sequences
ORDER BY sequence_id;

Notice that array offsets are 1-based. Attempting to read an out-of-bound offset (or to write an offset past the maximum cardinality) results in an exception. Any element of an array may be null, and there is no straightforward means of declaring them NOT NULL. Updating an element past the current cardinality automatically extends the array and sets any new elements thus created to null.

Arrays can be concatenated with the usual notation:


CREATE VIEW spliced_sequences AS
SELECT s1.sequence_id, s1.codons || s2.codons AS spliced_codons
FROM genome_sequences s1, genome_sequences s2
WHERE s1.chromosome_number = s2.chromosome_number
AND s2.start_offset = s1.start_offset + CARDINALITY(s1.codons);

Array elements can be set individually in the SET clause of an UPDATE statement, or the entire array can be set as the target of an INSERT or UPDATE:


UPDATE genome_sequences
SET codons = 
    (SELECT codons 
     FROM spliced_sequences 
     WHERE spliced_sequences.sequence_id = genome_sequences.sequence_id);

Array assignment in which the maximum size of the target is less than the actual size of a source value results in an exception. In the example above, this will occur if two sequences spliced together make up a sequence longer than 1000 codons.

Arrays can be compared for exact match with = or <>:


CREATE VIEW matching_sequences AS
SELECT s1.sequence_id as sid1, s2.sequence_id as sid2
FROM genome_sequences s1, genome_sequences s2
WHERE s1.codons = s2.codons;

No other inequality operator is supported, and usage of arrays in just about any other context (including GROUP BY) is illegal. The UNNEST operator converts an array into a query expression:


-- convert a particular sequence into a table
SELECT codon,offset
FROM UNNEST((SELECT codons FROM genome_sequences WHERE sequence_id = 10032423))
     WITH ORDINALITY 
     AS codon_table(codon,offset)
ORDER BY offset

The WITH ORDINALITY clause specifies that the array ordering should be used to derive the offset column. Without this clause, the codons would be returned without any ordering information. What if we wanted to convert all of the sequences together? That's what the new LATERAL derived table feature is for:


SELECT 
    g.sequence_id,
    codon_table.codon,
    codon_table.offset as relative_offset,
    g.start_offset + codon_table.offset - 1 as absolute_offset
FROM 
    genome_sequences g,
    LATERAL(UNNEST(g.codons) WITH ORDINALITY)
    AS codon_table(codon,offset)
WHERE chromosome_number = 7
ORDER BY absolute_offset,sequence_id

This should result in a table containing all of the codons from chromosome #7. The first column is the sequence containing the codon. The second column is the codon itself. The third column is the relative offset of the codon within the sequence. The last column is the absolute offset of the codon within the chromosome. Returned rows are ordered by absolute position.

Note that the new LATERAL keyword is normally required in order to reference a table expression defined earlier in the FROM list (g in this example). However, when UNNEST is specified, LATERAL is implicit.

SQL2003 Array Type

SQL2003 extends array semantics in a number of ways. Arrays can be declared without a maximum cardinality, in which case the maximum is vendor-defined (possibly unbounded like a LOB). Array nesting is unrestricted. And a new constructor is added for converting a query into an array:


INSERT INTO listbox_choices VALUES(
    'Department Names',ARRAY(SELECT name FROM sales.depts ORDER BY 1));

SQL2003 Multiset Type

SQL2003 also introduces another collection type known as a multiset. A multiset is much like an array, but unordered, and has more useful operators. Although multisets and arrays are both collections, they don't mix (e.g. you can't INSERT a multiset into an array column).

Unlike arrays, multisets never have a declared maximum cardinality:


CREATE TABLE logins(
    session_id INT NOT NULL PRIMARY KEY,
    successful BOOLEAN NOT NULL,
    uid INT,
    attempts ROW(VARCHAR(128),VARCHAR(128)) MULTISET);

Multiset constructors are similar to array constructors:



INSERT INTO logins VALUES(
    1000,true,0,
    MULTISET(
        ROW('root','31337'), 
        ROW('scott','tiger'), 
        ROW('root','beer')));

INSERT INTO logins VALUES(
    1001,false,0,MULTISET(SELECT ROW(name,password) FROM bogus_accounts));

(Note that ORDER BY is not allowed on a multiset constructor query since multisets are unordered.) As syntactic sugar, another constructor named TABLE is also provided which automatically converts a query into a multiset of rows. The next example is equivalent to the previous query-based MULTISET constructor:


INSERT INTO logins VALUES(
    1001,false,0,TABLE(SELECT name,password FROM bogus_accounts));

CARDINALITY returns the total number of elements in a multiset (not the number of distinct elements). Multisets cannot be concatenated (since they aren't ordered); instead, a number of set operators are provided for multisets:



multiset1 MULTISET UNION [ALL|DISTINCT] multiset2

multiset1 MULTISET INTERSECT [ALL|DISTINCT] multiset2

multiset1 MULTISET EXCEPT [ALL|DISTINCT] multiset2

Some multiset-specific predicates are also provided:



-- find sessions with a particular username/password combination
SELECT session_id
FROM logins
WHERE ROW('root','31337') MEMBER OF attempts;

-- find sessions where no combination was retried
SELECT session_id
FROM logins
WHERE attempts IS A SET;

-- find sessions whose attempted combinations subsume those of other sessions
SELECT a.session_id as sub_id,b.session_id as super_id
FROM logins a,logins b
WHERE a.attempts SUBMULTISET OF b.attempts;

UNNEST can be used on a multiset (but not WITH ORDINALITY, obviously). The multiset-specific ELEMENT operator converts a multiset with cardinality 1 into a row expression:


SELECT session_id,ELEMENT(attempts).name
FROM logins
WHERE CARDINALITY(attempts) = 1;

Finally, aggregation operators are provided for collecting row values into multisets, and for aggregating multisets:


SELECT 
    uid,
    COLLECT(session_id) AS session_ids,
    FUSION(attempts) AS all_attempts,
    INTERSECTION(attempts) AS common_attempts
FROM logins
WHERE successful
GROUP BY uid;

For each user, the query above returns a row with the uid key and three multiset columns. The session_ids column contains a multiset of the sessions which successfully logged in that uid. The all_attempts column contains a multiset of all of the username/password combinations used for that uid. The common_attempts column contains a multiset of any username/password combinations which were tried for all logins of that uid.

Implementation

Many existing system components will be involved in the implementation of collection types. It might be possible to use a rewrite approach to keep collection types out of the optimizer and executor via hidden joins and locators, but this would make it very difficult to preserve good locality of reference for small-to-medium sized collections.

Instead, a more straightforward approach is to treat collections as opaque variable-length byte arrays in parts of the system which don't need to understand them. This implies that collection size will be bounded by implementation details such as page size. (Eventually, once Farrago has LOB support, it should be used to remove size limits on collections.)

A natural internal structure for these byte arrays is the same format we use for passing tuple data between XO's: a contiguous sequence of marshalled tuples. (For collections of non-row type, these would be 1-tuples; for collections of row type, we need an indicator for the entire row being null--maybe a hidden boolean attribute.) A fixed-size header containing information such as cardinality might also be necessary.

Given this representation, it's also natural to implement collection operations via standard XO's. For example, multiset UNION can be implemented in the same way as a normal query UNION. The collection byte array can be referenced directly as input to these XO's. To accomplish this, several components must collaborate:

The optimizer must identify subtrees of row expressions where all the nodes are collection operations, transform these into corresponding relational expressions, and generate a subplan implementation for them. The ExecutionStreamDefs thus generated will be disconnected from the rest of the query tree. (Note that the optimizer already supports disconnected graphs, and the process of identifying collection expression subtrees is similar to some of the existing FarragoAutoCalcRule code.)
A new calculator instruction is required for invoking a subplan. The instruction is responsible for accessing the byte-array registers corresponding to collection-valued inputs, binding them to the input XO's in the subplan, executing the subplan, and then copying the subplan's output into a byte-array result register. (Alternatively, a specialized XO could be used instead of the calculator.)
The optimizer must work together with the calculator code generator to bind the subplan to the generated instruction.

In fact, much of this is similar to what's needed in order to execute correlated subqueries when they are not rewritten as joins, so this work is useful for more than just collection type implementation.

Based on the above approach, here are the components affected and estimates for the amount of work needed to implement multisets (but not arrays, which can be tackled as a follow-on project):

Parser (including SqlNode/SqlOperator representation). [4 days] This covers:
- multiset constructor from values
- multiset constructor from query
- multiset table constructor from query
- multiset UNION/INTERSECT/EXCEPT
- ELEMENT operator
- MEMBER OF operator
- IS A SET operator
- SUBMULTISET OF operator
- COLLECT operator
- FUSION operator
- INTERSECTION operator
- UNNEST/LATERAL operator
- CARDINALITY function
- multiset type declaration in DDL parser
Validator: validation for operators, functions, and DDL (see parser list). [5 days]
Sql2RelConverter: convert LATERAL query WHERE clause to extra join condition, UNNEST to UnnestRel, and various other operators (see parser list). Initial implementation will be very restrictive in terms of what is supported inside of a LATERAL query. [4 days]
Type system: add collection types. [2 days]
Optimizer: UnnestRel representation; generate subplans for multiset operators. [7 days]
Executor: XO's for converting from TupleStream to single multiset value and vice versa. REVIEW: anything special needed in aggregate XO's for implementing COLLECT/FUSION/INTERSECTION? [3 days]
Calculator: new instruction for executing subplan (includes code generation work). [6 days]
JDBC driver: java.sql.Array implementation (have to use this for multisets since nothing else exists). This requires extra support for executing a special kind of subplan to produce the ResultSet representation of the collection. This feature is optional (UNNEST is much more powerful and can be used to accomplish most of the same functionaliy.) [8 days]

Notes

For contrast, see the PostgreSQL docs, which covers both standard and non-standard syntax and semantics.
See also the official SQL:2003 announcement.
This document does not cover non-SQL language bindings. For JDBC, see the Javadoc.
The examples in this document should not be taken as recommendations for good schema design. (Many critics have suggested that the addition of collection types to SQL is a step backwards.)