ItemSpace Data Structures

Copyright © 2002 Roger L. Deran and Boiler Bay Inc..

This document describes some of the rich variety of data structures that can easily be superimposed on a very simple lower-level data model called an ItemSpace. The superimposed data structures can take the form of sets, extensible relations, relational indexes or inversions, texts, ‘binary long objects’ or ‘BLOB’s or ‘character  long objects’ or ‘CLOBS’s, or an ‘Entity-Attribute-Value’ semantic net, among others. The lower-level ItemSpace is simply an ordered set of Items, each Item being a variable-length sequence of chars.  A persistent ItemSpace implementation typically uses a B-Tree, but ItemSpaces can be implemented in a wide variety of ways.

The ItemSpace Model

In this document we will describe only the Java-language ItemSpace implementation, although in principle C C#, or C++ would work fine. One implementation was in 8086 assembly language.

An ItemSpace is a value-ordered set of Items, each Item being a variable-length sequence of char’s. For ordering purposes, the Items are compared character-by-character like String’s. However, an Item is not a String, nor is it used in the same way, as the char’s in an Item will often contain binary information as well as printable characters. An ItemSpace has no state other than the Item sequence, and is concurrently usable and Thread-safe. An ItemSpace supports accessor methods like next() and previous(), which retrieve the Item nearest a given Item in the ItemSpace, and it optionally supports the modifier methods insert() and delete(), which operate on a given Item. Further description of the ItemSpace model will continue in the ‘Items and Cursors’ section. Because an ItemSpace is so simple, ItemSpace implementations can be very space and time efficient.

About the terms ‘Item’ and ‘ItemSpace’

The ItemSpace data model is normally implemented using a special B-Tree that stores no data attached to each ‘key. ‘ The term ‘Item’ was invented  because it is only the existence or non-existence of Items that carries information in an ItemSpace. Using the terms ‘key’ and ‘KeySpace’ would be misleading: a key should unlock something and yield the thing that gets unlocked. Furthermore, the term ‘key’ is still needed, but in a higher-level sense, when describing the Extensible Relation and Dynamic Index data structures, which are explained in the sections below. Extensible Relations and Dynamic Indexes are built out of Items, but parts of those Items can act as primary and secondary keys for the relations. The keys used in that sense do ‘unlock’ something: the row data of the relations.

Components Inside an Item

An Item is normally composed of a sequence of encoded primitive values such as longs, floats, Strings, Dates, and so on, called 'components'. Each component is stored as a variable-length sequence of chars with an initial type identifier char. Primitive values may be appended quickly to existing Items as components, and any Item can be quickly parsed back into its components to get the original primitive values. The ItemSpace considers Items as char arrays without any knowledge of the components inside, hence the component structuring of Items is optional.

Some ItemSpace Data Structures in Brief

Let’s explore the ItemSpace data structures that have been used in practical applications as well as some further possibilities. These data structures may well seem obvious once described, but the fact is that they have rarely been seen in practice, perhaps because they are dependent on at least four basic things: a) the variable-lengths of Items in the ItemSpace, which allows multiple uses of a single ItemSpace as well as high storage efficiency; b) the efficient storage and searching of Items that have common prefixes, since many ItemSpace data structures depend on repeating long prefixes many times; c) a set of encodings for primitive values in an Item that allow Items to sort correctly, unlike the sorting of int’s converted to Strings for example, and d) a sufficiently fast in-cache search capability, since repetitive local access is sometimes needed. Fast in-cache searching has become much easier, as CPU speeds generally increase on a steeper curve than disk speed.

Below is a list of some of the basic ItemSpace data structures, with brief notes on the advantages of each. These structures are discussed fully below.

ItemSpace Data Structures

·         Extensible Relations:

o        Null  values take no space because no Item is stored with that value

o        Column cardinality  (single- or multi-valued) and size are runtime alterable and extensible

o        Unused. deleted, or future columns take no space, with no limit on number or size

o        Unused, deleted, or future rows take no space, with no limit on number or size

o        Unused, deleted, or future relations take no space, with no limit on number or size

o        Come into existence at the moment of first use: no schema changes needed in advance.

o        One less disk access is required than indexes using tuple-id’s, even if clustered

·         Dynamic Indexes

o        Inversions like relational indexes may be created, deleted, and maintained on-the-fly

o        Unused, deleted, or future indexes take no space, with no limit on number or size

o        Come into existence at the moment of first use: no schema changes needed in advance.

·         Composite Keys and Column Values

o        Are simple concatenations of self-delimiting encoded primitives, Dates, limited-length Strings, and limited-length byte and char arrays.

o        Each pritive encoding has an initial type id char., then data char’s.

o        A variable number of primitives can be concatenated for each composite occurrence, so both <int,int> and <int,int,int> can be used together and compared to each other for ordering.

o        Primitive types at each position are variable, so <String,int> composites can be mixed with  <String,String > composites for example.

·         Sets and Multi-valued Columns

o        Simply groups of Items having any common prefix, such as a <databaseId,set id> or a      <databaseId,relationId,primaryKey,ColumnId > composition

o        Unused, deleted, or future sets take no space, with no limit on number or size.

o        Come into existence at the moment of first use: no schema changes needed in advance.

·         Entity-Attribute-Value Model

o        A generalization of the relational model. The ‘EAV’  model is cleaner, more intuitive, and extensible.

o        Uses <EntityClassId,Entity,AttributeId,Value> quads.

·         CharacterLongObjects and BinaryLongObjects (CLOB’s and BLOB’s)

o        Look like embedded files

o        Blocks are numbered

o        Efficient serial (and in the future, random) access

o        Unused, deleted, empty or future CLOB’s, BLOB’s take no space, with  no limit on number or size

o        Come into existence at the moment of first use: no schema changes needed in advance.

·         Easily Editable Text

o        Floating-point line-numbered lines are directly insertable and deleteable.

o        Efficient access randomly or serially

o        Unused, deleted, empty or future texts or text segments  take no space, with  no limit on number or size

o        Come into existence at the moment of first use: no schema changes needed in advance.

·         Trees

o        Rely on variable-length composite keys to construct paths.

o        A sequence of String components is like a directory structure.

·         Future Designs

o        Packed Items

o        Item chaining

o        Persistent Objects.

The above data structure descriptions repeat the phrase ‘with no limit on number or size’ very often, and also ‘unused. deleted, or future . . . take no space’, as well as ‘no schema changes needed in advance.’ These are common features of ItemSpace data structures.

Items and Cursors

An Item existing outside the ItemSpace normally lives in a ‘Cursor’. The ItemSpace’s insert(), delete(),  next(), and previous() methods among others accept Cursors containing Items as parameters. The actual class implementing the Cursor concept is called simply ‘Cu’, which is conveniently short since it occurs so often in the code.. We will refer to ‘Cursors’ in the text. A Cursor has no state other than the Item it contains. A Cursor is used by one Thread at a time.

 A Cursor may seem very similar to a Java StringBuffer, in that it supports an overloaded append() method on primitive values for constructing sequences of char’s. However, appending an int to a Cursor creates a binary-encoded sequence rather than a sequence of printable chars. Each value appended to a Cursor creates a self-delimiting subsequence of the Item’s characters called a component, and it can be compared to other components in the same location in other Cursors in such a way that the ordering is appropriate to the types of the components.  In other words, if two int’s are compared directly, the result will not always be the same as if they are appended to two StringBuffers that are then converted into Strings and compared. On the other hand, when two int’s are appended to two empty Cursors or to two Cursors containing the same Items, then the Cursors are compared, the results agree with the comparison of the int’s themselves.  This comparison-consistency comes from the type-specific encoding of the data into the char’s of a component, as described in the next section. Another difference with StringBuffer is the way toString() works: Cu’s produce something useful primarily for debugging purposes.

Storing and Retrieving Data

Here is how to store and retrieve an unknown value or ‘data’ given some ‘key’ or identifying value.  The key and data can each be any sequence of components. Append the key and value components to a Cursor, and invoke insert() on the Cursor to store them into the ItemSpace. To retrieve the value, place the key into a Cursor and invoke next() on the Cursor. The data, if any, is now at the end of the Cursor after the key. This is a slight oversimplification, because there is an additional parameter to next() called the ‘protected prefix length’ or usually just ‘pl’, which identifies the number of characters in the Cursor that are to be kept the same after next() is invoked.  The full signatures of these most important methods of the ItemSpace class are:

public void insert(Cu cu) throws IOException;

public void delete(Cu cu) throws IOException;

public boolean next(Cu cu,int pl) throws IOException;

public boolean previous(Cu cu,int pl) throws IOException;

 Rather than modifying the protected prefix, next() returns false and the Cursor’s value is preserved.  Therefore, the boolean return value indicates the existence or lack of a value for the given key. To get the proper value of ‘pl’, normally the original length of the Cursor is used. In a while loop, for retrieving multiple data values for the same key, the pl is the original length of the Cursor, and the return value from next() breaks the loop when false. Examples of these usages are given below.

Primitive Data Encodings

Each encoded component begins with an internal initial code char or ‘idChar’ which identifies both the type of the component and sometimes part or all of the value. It is not necessary for an application to use these idChar’s directly. The Cu.append() method is overloaded to handle long, float, double,  boolean, Date, String, char[], and byte[] types. The two array types are for short sequences of chars or bytes: unlimited length sequences use the CLOB’s or BLOB’s and multiple Items as described below. The actual encodings of the components are in principle not important to an application programmer. Nevertheless, here are some descriptions of the component encodings for the primitive types.

o        boolean is stored as one of two idChar’s, with no data chars following..

o        long is appended with an idChar followed by a four-char big-endian binary value with most-significant bit reversed.  The most-significant bit of a long is reversed so that it sorts correctly. Longs near zero are stored in one total char by using a range of idChar’s instead of one idChar as in most of the other primitive encodings.

o        float and double are stored as one idChar followed by a big-endian binary sequence of two or four chars with the all the non-sign bits reversed if the number is negative, and with the sign bit reversed, to get the correct sorting behavior.

o        Date is stored somewhat like a long, as a number of milliseconds since the epoch, but it has a different idChar and does not use a range of idChars for values near zero.

o        String appended to a Cursor contains an initial idChar and a zero char at the end to delimit it, and binary escape sequences embedded in it to allow any character to occur inside. The escapes that are used for Strings are:  0 becomes 1 followed by 2, and 1 becomes 1 followed by 3.

o        char[] is stored as an idChar followed by a char length and the chars in the array starting at the lowest-indexed char.Relatively short arrays can be handled.

o        byte[] is stored as an idChar followed by a char length and the bytes in the array starting at the lowest-indexed char. Relatively short arrays can be handled.

These encodings are not the only possible ones but have been found good in practice. For backward compatability reasons, these encodings will not be changed without a very compelling reason, but they could be extended in the future.  If the component cannot be appended because it would make the Cursor too long for the Cursor’s implementation, a CursorLengthException is thrown, and the actual length does not change.

As a practial matter, these encodings can be very efficient when used with an ItemSpace that further subjects Items to UTF-8 and then ZLib compression before being stored on disk, yielding a one- or two- followed by a three-fold compaction respectively. Because long’s can be stored efficiently in this way even when there are many leading zeroes, and especially efficiently for the values near zero, Cu.append() for byte, short, and int were omitted. Allowing these other integral types is dangerous for two reasons. One is that it is easy to get the wrong type when using overloading – for example, a short can be changed to an int or an int to a long during normal arithmetic promotion or during normal program maintenance. The other is that it is difficult to predict how large a particular number may need to be when designing the schema for a database – it is much better to be conservative, but the tendency is to be stingy when it is perceived that space will be wasted. Using a long everywhere may seem extravagent, but there is really no significant space or time overhead given good data compression. Even without any UTF-8 or ZLib compression, simple common-prefix compression will remove the idChar and some of the subsequent ‘data’ chars in common cases.

When a component is to be extracted from a Cursor, the offset of the component is needed, and possibly the type. The actual type of the component can be retrieved using Cu.typeAt(int offset), which returnes a distinct int for each of the possible component types. The types are static final int’s such as Cu.STRING_TYPE. These types are int’s and are not the same as the idChar’s at the beginning of the encoded components. There are also methods like ‘String stringAt(int offset)’ for each of the types that append() accepts.  Using stringAt() with the offset of a component that is not a String will result in a CursorDataTypeException. There is thus a form of runtime type safety.

The offset of a component can be determined from the preceding component’s offset using a method like ‘int skipString(int offset)’ or ‘int skipComponent(int offset)’ if the component type is not known. Most often, the offset of a component will be known from the context, and the entire Item need not be parsed from the beginning to find it.  The offset of the component will frequently be the same as the prefix length ‘pl’ that was used in a preceding next() invokation. If the offset is too large, CursorIndexOutOfBoundsException is thrown. In principle it is possible for an erroneous offset to point into the middle of a component, but this happens very rarely in practice. A bad offset will normally cause either a CursorDataTypeException or CursorIndexOutOfBoundsException to be thrown.

There is a way to bypass the strict typing of the components in a Cursor, say, for debugging purposes, to understand the component encodings, to create extremely efficient storage formats, or to work with ‘PlainText’ Items which are unencoded strings. Cu.charAt(int) will return the char at any offset in a Cu. Cu.append(char) appends exactly one char without any type identifier char preceding it. Cu setCharAt(int off,char c) will change one char at any position. There are also several ‘PlainText’ related methods in Cu which each have ‘PlainText’ in the method name and which can be used for working with raw character data in a Cu. Working with such raw data will not be discussed further here.

It is possible in many situations to use a Cursor without dealing with the types of the components inside. One can often stay within the Cursor-and-ItemSpace world entirely, avoiding, for example, the construction of Strings by stringAt(int offset), which can be a performance limit. There are methods for appending one Cursor onto another and for extracting a suffix from one Cursor and appending it onto another. Also, generic programming can use Object Cu.componentAt(int offset), int Cu.skipComponentAt(int offset), and Cu.append(Object). For such generic programming, primitives are represented by their corresponding wrappers such as Long.

The Item Length Limitation

Every ItemSpace and Cursor has an upper limit on the size of an Item which it can store, but it is normally much greater than that needed for any practical purpose. For example, the Infinity Database Engine has a 1666 character limit. Almost all other databases have shorter key length limits. This limit is fine given the following reasoning. Items can loosely be broken down into a primitive or short composite ‘name or identifier’ part and a primitive or short composite ‘data’ part, and the combination of the two has a natural tendency to be relatively small. Typical names and identifiers of real-world things range from, say, 1 to 100 characters, and the data part will normally be relatively small - typically a primitive, a small number of primitives in a composite, a segment of a CLOB or BLOB, or a text line, for example.

 When long Items are expected to arise, a proper data structure design can always avoid the problem. Long pieces of data can always be broken up into multiple Items in some way and stored in an ItemSpace, sometimes even providing a higher degree of editability, extensibility, and flexibility as a result. As an example, a long text can be broken at line boundaries, and each line can be stored with a line number prefix, with long lines being similarly broken down into numbered fixed-length segments. Also,  long sequences of chars or bytes can be broken into blocks and numbered  (see the Easily Editable Text  and CLOB’s, and BLOB’s section below.) In real-world use, the Item-length limitation is not a problem. Furthermore, real-world applications require predictable memory use for every operation, so a very long String  that comes entirely into memory for each access (this actually happens in some DBMS’s) makes application stability dependent on the data. High-reliability programming requires that instantaneous maximum memory use be limited to avoid OutOfMemoryErrors, so data must be handled in some kind of finite-length chunks or as a Stream.

Components often have some kind of externally defined size limitiation. Although these days we are being far more liberal in our size limitations, the limitations nevertheless exist. For example, file paths are limited in virtually all modern operating systems to the order of a hundred chars. A length budget can guarantee that total Item length will not be exceeded when the external limits are known. If there are no external limits, CLOB’s or the like can always be used.

Extensible Relations

An Extensible Relation is constructed of nothing more than a set of Items having a common prefix that identifies the database instance and relation, concatenated with a possible composite primary key, then a column identifier, and finally a possibly composite value:

 

 

 


This storage organization would be very inefficient except that an ItemSpace engine can do common-prefix compression to avoid repeating the database Id, relation Id and primary keys or even column Id’s in adjacent Items. Common-prefix compression operates at the raw Item level and is unaware of the component boundaries, so if adjacent Items share any char’s at the beginning, the shared char’s  will not be stored. An example of two stored relation rows might look like the table below, with one Item per table row. The cells to the left are blank when there is a common prefix. The fixed-length of each table row below does not indicate the corresponding Items are the same length. Each relation row comes out vertical in this table.  The fact that rows of the relation come out vertically in the table below could be called a ‘meta twist.’

 

Database Id

Relation Id

Primary Key

Column Id

Column Value

Chem

Students

Jenkins, Waldo

Student Id

3451

 

 

 

Class

Inorganic 101

 

 

 

 

Proteins 201

 

 

…son, Paul

Student Id

2665

 

 

 

Class

Organic 402

 

 

 

 

Lipid Synthesis 202

 

Note that the ‘..son, Paul’ starts with a ‘..’ which is intended to indicate that the engine has compressed away those characters. Also note that the ‘Class’ attribute can be multi-valued.

Relation Extensibility

With the given structure, there is no limit to the number of databases, the number of relations in a database, the number of rows in a relation, the number of columns, or the number of values in a table cell that can be stored in an ItemSpace. Each component of the Item has a (virtually) unlimited space of combinations. A database, row, or cell that is empty simply has no corresponding Items, and takes no space.

No anticipation of future extensions of an extensible relation is needed in most cases. For example, new columns can be added to a table of an existing database at runtime without requiring the equivalent of a SQL ‘ALTER TABLE table ADD COLUMN column. Sometimes that SQL statement is not even available, and a database dump/drop/create/restore is needed. The null values for the new column do not need to be stored in the database because they are effictively already there since they are represented by the absence of a corresponding Item. Sparse relations are very efficient because of this.

 Data can also be stored into a new relation without creating the new relation in advance. There is no need for the equivalent of an SQL ‘CREATE TABLE’ statement, and any existing database is immediately ready to store data into any new relation, since an empty relation is represented as the absence of any Items containing its id. A newly extended database is forwards compatible as well as backwards, since a newer database version can be used with an older application that simply ignores the new relations or columns.

Adding Data

Adding cell values to the database is simple. A Cursor is allocated, components are appended to it, and it is passed to insert(). The temporary Cursor may then be returned to the pool for quick re-use:

Cu cu = Cu.alloc(); // Get a temporary Cu from the pool. ~1666 chars.
// append() returns the Cu again, so append() can be chained.
cu.append(databaseId).append(relationName)
      .append(primaryKey).append(columnId).append(cellValue);
db.insert(cu);      // ‘db’ is some pre-existing ItemSpace
Cu.dispose(cu);     // Optionally return Cu to pool for maximum speed.

If more Items are to be inserted, the above code can be repeated for each Item, but the alloc() and dispose()can be shared, surrounding the whole sequence,  with the second and subsequent ‘cu.append(..’ lines changed to ‘cu.clear().append(..’.

Accessing Data

The next() method can be used with various protected prefix lengths and a truncation technique to enumerate, among other things: all the databases, all the relations in a database, all the keys in a relation, all the records in a relation, all the columns in a record, or all the cell values in a multi-valued column.

 Let’s examine finding a cell value for a given row and column of the database. First, an Item is constructed having the databaseId, relationName, primaryKey for the desired row, and the columnId. Then ‘boolean next(Cu cu,int pl)’ is applied to that Item, with the protected prefix length pl set to the initial length of the entire Item. The result will be a true or false from next(), and a possible change in the Item in the Cursor. If a true is returned, the end of the Cursor contains the column value, otherwise, the column value is null (not the Java null but the database concept of null, i.e. a missing value.)

Here is the code for retrieving a column value:

Cu cu = Cu.alloc().append(databaseId)
.append(relationName).append(primaryKey).append(columnId);
int prefixLength = cu.length();
if (db.next(cu,prefixLength)) {  // Item exists, column is not empty
      System.out.println(“value=” + cu.componentAt(prefixLength));
}
Cu.dispose(cu);

The actual value extraction is simply a cu.componentAt(prefixLength), which gets a wrapped primitive from the offset prefixLength in the Cursor cu. If the component type were known to be a long, for example, long longAt(int off) could have been used and a wrapper construction avoided. The Cursor cu is allocated from the temporary pool in this example, although a preexisting Cursor could have been reused after a cu.setLength(0) or cu.clear(). The Cu.dispose() is optional but recommended for speed, since garbage collecting a 1666 char array can be slow. A Cursor must be used by only one Thread at a time.

If the column had been of a multi-valued type (which is illegal in conventional RDBM’s) then the ‘if’ would be replaced by a ‘while’ – that is all. As will be described below, multi-value columns can be very useful.

Enumerating the Rows

Now let’s examine a less common operation: enumerating the set of rows in a table. This works, with a few modifications, for enumerating a subsequence of rows as well. First, an Item is constructed containing only the database Id and table Id. In a loop, it is moved along using next() until false is returned. On each iteration, an Item will be returned that contains, after the database Id and table Id, not only the key, but a column Id and a value. These extra parts – the column Id and value - are simply truncated away, and the key is ‘incremented’ in order to prepare it for the next iteration. The incrementing adds one to the last char of the Cursor, and if this brings the char around to zero again, the previous char is incremented and so on. Without this incrementing, the Cursor would not advance because the next()would only retrieve the same key, column Id and value. The Cursor would only be extended rather than moved forward. Here is the code:

Cu cu = Cu.alloc().append(databaseId).append(tableId);
int prefixLength = cu.length();
while (db.next(cu,prefixLength)) {
      String key = cu.stringAt(prefixLength); // Or componentAt()
      System.out.println(“key=” + key);
      int offsetAfterKey = cu.skipString(prefixLength);
      cu.setLength(offsetAfterKey); // Truncate
      cu.incrementSuffix(prefixLen); // Avoids same key
}
Cu.dispose(cu);

In the above code, setting the Cursor’s length to the offset after the key truncates the Cursor before the columnId and value. The application can also use skipComponent(int offset) to skip over a String or long or other type component without knowing or being dependent on the component’s type. Also, skipComposite(int offset) can move over several concatenated components at once, allowing dynamic composite keys and values, by internally repeating skipComponent(int offset) until a special delimiter component is found which is not one of the primitive encodings, as described below in the Entity-Attribute-Value data structure.

Dynamic Indexes

Often an extensible relation will need inversions on some of its secondary fields. To do this, Items of the following form can be used to create an index:

 

 

 


The set of index Id values here must be disjoint from the set of relation Id’s. Accessing by secondary key requires code like this, for a string-valued primary key:

Cu cu = Cu.alloc().append(databaseId).append(indexId)
      .append(secondaryKey);
int prefixLen = cu.length();
while (db.next(cu,prefixLength)) {
      System.out.println(“primary key =” + cu.stringAt(prefixLength));
}
Cu.dispose(cu);

This is almost identical to the code for accessing a column value, except that there is a ‘while’ loop instead of an ‘if‘ statement since there may be multiple primary keys that match a given secondary key.

Note that in the data structures built so far, inversions are maintained by the application program and are not automatic as they are in relational systems. The maintenance of the inversions in existing applications can be easily handled instead by utility code that factors out the work. It is often argued that the existence of an inversion on a particular column of a relation is something that should be determined by the DBA at deployment time or later in order to maximize performance. In actual practice, applications make assumptions about which columns are indexed in order to provide acceptable performance. It is not a serious limitation that the application hard-wires the indexing structure.  When the DBA makes a mistake by removing an index, an application can be catastrophically slow; when an extra index is added, it is a waste of space and cycles.

But even a hard-wired index structure can be changed by future versions of the application program, or by special application features. New indexes can be created and maintained invisibly, and old indexes can be abandoned or deleted, all without obsoleting the file format. Indexes that are intended only as temporary sorts can be created on-the-fly and deleted when no longer needed, without DBA intervention.  ItemSpaces do not provide, at a low-level, the ad-hoc definition of orderings like the ‘ORDER BY’ clause of SQL, but there is no restriction on application code that prevents something similar being implemented at a higher level.

In practice, real relations are often indexed on all meaningful columns so that ‘ORDER BY’ will be reasonably efficient because dynamic sorts are minimized. The same thing is done in typical ItemSpace-based applications as well, so dynamic sorts can often be avoided. Many applications cannot tolerate the delays or unpredictable storage space requirements of dynamic sorting anyway, and will require a fixed set of inversions.

Composite Keys

Is very often the case that a primary or secondary key needs to be composed of more than one primitive value. Using the formats for data primitives discussed so far, this requires code that is aware of the composition. The Items might be formatted as follows:

 

 

 


To add a column value into an extensible relation that has a composite primary key, you might use code like this:

Cu cu = Cu.alloc(); // Get a temporary Cu from the pool.
cu.append(databaseId).append(relationName)
      .append(primaryKey1).append(primaryKey2)
      .append(columnId).append(value);
db.insert(cu);      // ‘db’ is some pre-existing ItemSpace
Cu.dispose(cu);     // Optionally return Cu to pool for maximum speed.

Sets and Multi-Valued Columns

The code above, shown under ‘Dynamic Inversions’, has a ‘while’ loop that allows multiple primary key values to be read in sequence. This is the same kind of code that would be used to access any kind of set of values, such as a set of primary keys for rows that have been flagged as being in a particular category, or a set of values for a multi-valued column. Sets of rows can be created and deleted dynamically simply by inventing unique set Id’s and storing or deleting Items like this:

 

 

 


Multi-valued columns are normally prohibited in relation theory, but are actually quite useful. Sometimes a column is considered at schema design time to be single-valued, but is later discovered to need to be multi-valued. This actually happened in the development of the BoilerBase email indexer application. A column containing the file offset of a message in an external mail file needed to become multi-valued and composite because it was realized that the same message could exist in more than one mail file. Some of the mail files might not be available at a particular time, so the multiple values had to be tried in sequence until a working one was found. The field value was simply stored multiple times by inserting multiple Items, one for each message pointer, where a message pointer became a composition of an offset and a mail file name. The rest of the Item components stayed the same.

The Entity-Attribute-Value Model

The relational model has been almost universally accepted for persistent structural data storage, so a new model has a considerable uphill climb to make before being embraced. However, the Entity-Attribute-Value model can be viewed as an extension of the relational model that is natural in an ItemSpace. It unifies the concepts of relation and index and allows multi-valued and composite columns. The resulting structure is simpler and more satisfying.

The term ‘relation’ is replaced by the term ‘EntityClass’. Each primary key value of each relation is considered to identify an ‘Entity’ rather than a row. The column id’s are called ‘Attribute Id’s and the cell values are still called Values. (This may not be considered a significant change in names but it seems helpful in practice.)  Data is stored in Items structured like this:

 

 

 


where an Entity can be composite as in relations, but Value’s can also be composite, unlike in relations. Databases contain EntityClasses, which contain simple or composite Enties, which contain Attributes, which contain simple or composite Values.  The fact that Values can be composite as well as Entities gives an EAV Item a certain symmetry. Also, the Values of any Attribute may in principle be multi-valued in the EAV model, although the semantics of a particular Attribute may not allow it. Relational table cells, on the other hand, always contain single, non-composite values. The capability of Attributes to be multi-valued combined with the possibility of composite Values allows relational indexes to be ‘folded’ into Entity Classes.

An Example EAV Database

Let’s examine the following example EAV database. The table below shows an EAV structure, one Item per row, with no composite Entities or Values. Composite Entities or Values could have been indicated by placing them in single table cells with some kind of separator between the simple values. This table format would not necessarily be visible directly to a user of the application, but would be interesting to the application author for debugging purposes in an “Item Editor”, which is a general-purpose direct ItemSpace browser and modifier.

 

Database Id

EntityClass Id

Entity

Attribute Id

Value

Chem

Student

Jenkins, Waldo

Student #

3451

 

 

 

GPA

3.58