CREATE TABLE genome_sequences(
sequence_id BIGINT NOT NULL PRIMARY KEY,
chromosome_number INT NOT NULL,
start_offset BIGINT NOT NULL,
codons CHAR(3) ARRAY[1000] NOT NULL);
Elements of the array may be declared with almost any datatype (in
this example, CHAR(3)
). However, in SQL99, nested arrays
are illegal (whether the nesting is direct or indirect); you can't
declare
illegal_column1 INT ARRAY[3] ARRAY[5]
nor can you declare
illegal_column2 ROW(INT,DOUBLE ARRAY[7]) ARRAY[4]
(This restriction is removed by SQL2003.)
Array values are created with the ARRAY constructor:
INSERT INTO genome_sequences VALUES(
10032423, 3, 432432,
ARRAY['CUG', 'AAG', 'GGU', 'ACU', 'CUU', 'GGU', 'UGG', 'UAA']);
The length of an array is retrieved with the CARDINALITY function:
SELECT sequence_id,CARDINALITY(codons) as sequence_length
FROM genome_sequences
ORDER BY sequence_length;
And the actual values can be selected with the usual bracket syntax:
SELECT
sequence_id,
codons[1] as first_codon,
codons[CARDINALITY(codons)] as last_codon
FROM genome_sequences
ORDER BY sequence_id;
Notice that array offsets are 1-based. Attempting to read an
out-of-bound offset (or to write an offset past the maximum
cardinality) results in an exception. Any element of an array may be
null, and there is no straightforward means of declaring them NOT
NULL. Updating an element past the current cardinality automatically
extends the array and sets any new elements thus created to null.
Arrays can be concatenated with the usual notation:
CREATE VIEW spliced_sequences AS
SELECT s1.sequence_id, s1.codons || s2.codons AS spliced_codons
FROM genome_sequences s1, genome_sequences s2
WHERE s1.chromosome_number = s2.chromosome_number
AND s2.start_offset = s1.start_offset + CARDINALITY(s1.codons);
Array elements can be
set individually in the SET clause of an UPDATE statement, or the
entire array can be set as the target of an INSERT or UPDATE:
UPDATE genome_sequences
SET codons =
(SELECT codons
FROM spliced_sequences
WHERE spliced_sequences.sequence_id = genome_sequences.sequence_id);
Array assignment in which the maximum size of the target is less than
the actual size of a source value results in an exception. In the
example above, this will occur if two sequences spliced together make
up a sequence longer than 1000 codons.
Arrays can be compared for exact match with = or <>:
CREATE VIEW matching_sequences AS
SELECT s1.sequence_id as sid1, s2.sequence_id as sid2
FROM genome_sequences s1, genome_sequences s2
WHERE s1.codons = s2.codons;
No other inequality operator is supported, and usage of arrays in just
about any other context (including GROUP BY) is illegal.
The UNNEST operator converts an array into a query expression:
-- convert a particular sequence into a table
SELECT codon,offset
FROM UNNEST((SELECT codons FROM genome_sequences WHERE sequence_id = 10032423))
WITH ORDINALITY
AS codon_table(codon,offset)
ORDER BY offset
The WITH ORDINALITY clause specifies that the array ordering should be
used to derive the offset column. Without this clause, the codons
would be returned without any ordering information.
What if we wanted to convert all of the sequences together?
That's what the new LATERAL derived table feature is for:
SELECT
g.sequence_id,
codon_table.codon,
codon_table.offset as relative_offset,
g.start_offset + codon_table.offset - 1 as absolute_offset
FROM
genome_sequences g,
LATERAL(UNNEST(g.codons) WITH ORDINALITY)
AS codon_table(codon,offset)
WHERE chromosome_number = 7
ORDER BY absolute_offset,sequence_id
This should result in a table containing all of the codons from
chromosome #7. The first column is the sequence containing the codon.
The second column is the codon itself. The third column is the
relative offset of the codon within the sequence. The last column is
the absolute offset of the codon within the chromosome. Returned rows are
ordered by absolute position.
Note that the new LATERAL keyword is normally required in order to
reference a table expression defined earlier in the FROM list
(g
in this example). However, when UNNEST is specified,
LATERAL is implicit.
INSERT INTO listbox_choices VALUES(
'Department Names',ARRAY(SELECT name FROM sales.depts ORDER BY 1));
Unlike arrays, multisets never have a declared maximum cardinality:
CREATE TABLE logins(
session_id INT NOT NULL PRIMARY KEY,
successful BOOLEAN NOT NULL,
uid INT,
attempts ROW(VARCHAR(128),VARCHAR(128)) MULTISET);
Multiset constructors are similar to array constructors:
INSERT INTO logins VALUES(
1000,true,0,
MULTISET(
ROW('root','31337'),
ROW('scott','tiger'),
ROW('root','beer')));
INSERT INTO logins VALUES(
1001,false,0,MULTISET(SELECT ROW(name,password) FROM bogus_accounts));
(Note that ORDER BY is not allowed on a multiset constructor query
since multisets are unordered.) As syntactic sugar, another
constructor named TABLE is also provided which automatically converts
a query into a multiset of rows. The next example is equivalent to
the previous query-based MULTISET constructor:
INSERT INTO logins VALUES(
1001,false,0,TABLE(SELECT name,password FROM bogus_accounts));
CARDINALITY returns the total number of elements in a multiset (not the number of distinct elements). Multisets cannot be concatenated (since they aren't ordered); instead, a number of set operators are provided for multisets:
multiset1 MULTISET UNION [ALL|DISTINCT] multiset2
multiset1 MULTISET INTERSECT [ALL|DISTINCT] multiset2
multiset1 MULTISET EXCEPT [ALL|DISTINCT] multiset2
Some multiset-specific predicates are also provided:
-- find sessions with a particular username/password combination
SELECT session_id
FROM logins
WHERE ROW('root','31337') MEMBER OF attempts;
-- find sessions where no combination was retried
SELECT session_id
FROM logins
WHERE attempts IS A SET;
-- find sessions whose attempted combinations subsume those of other sessions
SELECT a.session_id as sub_id,b.session_id as super_id
FROM logins a,logins b
WHERE a.attempts SUBMULTISET OF b.attempts;
UNNEST can be used on a multiset (but not WITH ORDINALITY, obviously).
The multiset-specific ELEMENT operator converts a multiset with
cardinality 1 into a row expression:
SELECT session_id,ELEMENT(attempts).name
FROM logins
WHERE CARDINALITY(attempts) = 1;
Finally, aggregation operators are provided for collecting row values
into multisets, and for aggregating multisets:
SELECT
uid,
COLLECT(session_id) AS session_ids,
FUSION(attempts) AS all_attempts,
INTERSECTION(attempts) AS common_attempts
FROM logins
WHERE successful
GROUP BY uid;
For each user, the query above returns a row with the uid key and three
multiset columns. The session_ids
column contains a
multiset of the sessions which successfully logged in that uid. The
all_attempts
column contains a multiset of all of the
username/password combinations used for that uid. The
common_attempts
column contains a multiset of any
username/password combinations which were tried for all logins of that
uid.
Instead, a more straightforward approach is to treat collections as opaque variable-length byte arrays in parts of the system which don't need to understand them. This implies that collection size will be bounded by implementation details such as page size. (Eventually, once Farrago has LOB support, it should be used to remove size limits on collections.)
A natural internal structure for these byte arrays is the same format we use for passing tuple data between XO's: a contiguous sequence of marshalled tuples. (For collections of non-row type, these would be 1-tuples; for collections of row type, we need an indicator for the entire row being null--maybe a hidden boolean attribute.) A fixed-size header containing information such as cardinality might also be necessary.
Given this representation, it's also natural to implement collection operations via standard XO's. For example, multiset UNION can be implemented in the same way as a normal query UNION. The collection byte array can be referenced directly as input to these XO's. To accomplish this, several components must collaborate:
Based on the above approach, here are the components affected and estimates for the amount of work needed to implement multisets (but not arrays, which can be tackled as a follow-on project):