The ItemSpace Data Model

InfinityDB is based on an ‘ItemSpace’ data model, with a unique minimal API that is elegantly simple and which is the basis for a rich conceptual stack of layers. Here we discuss the model in depth. Also see the basic ItemSpace example code.

Minimal Introduction

This section is the quickest possible introduction for programmers. Below, the dark screen is the set of ‘Items’ in the ItemSpace, each being apparently a line of text but stored sorted uniquely and encoded into binary as distinct components. These are accessed by prefixes that are given in a cursor called a ‘Cu’ in light blue. The ‘next()’ operation moves the cursor one Item forwards. The blue carat below the cursor is a ‘protected prefix length’, which cause next() to return false if the portion to the left would be modified, in which case nothing happens.

There are four such essential retrieval operations for >, <, >=, and <=, plus ‘mutator’ operations to insert, delete, delete suffixes, and update according to the contents of the Cu. Much more is built on these.

The Layers

The system can be dealt with at several levels of abstraction:

LayerDescription
ApplicationClient code, systems, and architecture;
AdapterVirtual views that abstract and indirectly manipulate ItemSpace structures such as ConcurrentNavigableMap, BinaryLongObjectOutputStream, BinaryLongObjectInputStream, CharacterLongObjectWriter, CharacterLongObjectReader, Index structures, and other views;
StructureThe logical combination of multiple Items to represent high-level concepts. These include tables, trees, sets, graphs, indexes, texts, lists, byte and character streams, images and more;
ComponentsItems are interpreted as containing a series of components of 12 data types, encoded packed using standard variable-length formats. Items can have variable numbers of components of mixed data types;
ItemSpaceA mutable ordered set of Items, which are uninterpreted, variable-length raw char sequences up to 1665 in length. An Item is a logical concept, like a number, and is immutable.

The Definition of ‘ItemSpace’

An ItemSpace represents a sorted set of ‘Items’, each Item being a fixed-length sequence of 0 to 1665 chars. An Item is a logical concept, like a number, and is immutable. In comparing Items, the initial chars  of the sequence are most significant, and shorter Items come first, and if the lengths are the same and corresponding chars are all equal, then the sequences signify the same Item. There is no other state. Individual Items can be inserted and deleted arbitrarily and retrieved one Item at a time bidirectionally in order or randomly based on a given nearest Item. A given Item occurs in a given ItemSpace at most once, and attempting to insert any pre-existing Item or deleting any already absent Item is legal and immediate but has no effect. Items can be represented outside an ItemSpace in various other ways, as binary or text.

ItemSpace API

ItemSpace is an abstract subclass of the abstract ItemOutput class. 

The ItemOutput class

There are four essential abstract mutation methods in ItemOutput. They do not return information about the results of the mutation, so that mutations may be transmitted without round-trip delays over a network, or when feedback is not available, such as into a temporary file or channel.

  • insert(item) – add an Item.  If the Item is already present, do nothing
  • delete(item) – remove an Item. If the Item is already absent, do nothing
  • deleteSubspace(prefixItem, boolean isDeletePrefix) – remove any and all pre-existing Items starting with prefixItem, and optionally the prefixItem itself.
  • update(insertedItem, int prefixLength, boolean isDeletePrefix)is an atomic:
    • deleteSubspace(insertedItemUpToPrefixLength, isDeletePrefix), followed by
    • insert(insertedItem)

The ItemSpace class

There are four essential abstract retrieval methods in ItemSpace. Each returns the Item in the ItemSpace nearest to a given Item. They return true iff a matching Item was found. A matching Item has the same chars as the given searchItem up to the given prefix length.

  • first(searchItem, int prefixLength) – get the nearestItem >= searchItem
  • next(searchItem, int prefixLength) – get the nearestItem > searchItem
  • last(searchItem, int prefixLength) – get the nearestItem <= searchItem
  • previous(searchItem, int prefixLength) – get the nearestItem < searchItem

Many convenience and default methods are provided by ItemOutput and ItemSpace based on the above. In many ItemSpaces, these operations are thread-safe, and often multi-core concurrent. For extreme speed, batches of Items can be read with nextItems(item, int prefixLength, byte[] buf, int off, int len) and written with writeItems(byte[] buf, int off, int len).  These use the fast ‘ItemPacket’ checksummed binary encoding for Items in buffers or streams. The ItemSpace default methods are often overridden for speed or for other reasons.

ItemSpace Implementations

ItemSpace implementations can ‘wrap’ or delegate to others, or provide special functionality. Some provide ‘views’ of others according to various rules. Many problems can be solved by simply composing ItemSpaces of various kinds. Another route is custom subclassing of ItemSpace, often in order to wrap instances of one or more other ItemSpaces. It is easy, however, to write applications using only the default InfinityDB database file ItemSpace, and none of those below.

These ItemSpaces are provided with InfinityDB:

  • InfinityDB database exposes its data as an ItemSpace, adding persistence, transactionality, concurrency,  and much more
  • AndItemSpace views the logical intersection of multiple given ItemSpaces
  • OrSpace views the logical union of multiple given ItemSpaces.
  • AndOrItemSpace is an optimized nested AndItemSpace and OrItemSpace with no nesting limit
  • IndirectItemSpace delegates to a given ItemSpace
  • ItemSubspace – views the suffixes of a given prefix of a given ItemSpace
  • RangeItemSpace – views a given ItemSpace over a fixed range of Items
  • VolatileItemSpace – is a simple in-memory ItemSpace, not concurrent or transactional
  • DeltaItemSpace – views a given static ItemSpace as if it were mutable, optionally applying mutations later
  • TransactionalItemSpace – handles transactions for InfinityDB
  • ReadOnlyItemSpace – throws Exception on mutations
  • EmptyItemSpace – always contains nothing
  • BufferedItemSpace – batches mostly-sequential operations, such as for remote access (Client/server)
  • RangeBufferedItemSpace – batches mostly-associated retrievals, such as for remote access (Client/server)
  • RemoteClientItemSpace – a transparently remote ItemSpace over the ‘ItemPacket’ protocol (Client/server)

The provided ItemOutputs are:

  • DiskBasedMergeSorterItemOutput – does time-efficient sorting of large datasets
  • ItemSpaceBasedMergeSorterItemOutput – does space-efficient sorting of large datasets
  • OutputStreamItemOutput – writes mutations to an OutputStream in an ‘ItemPacket’ format

There is also the independent ‘ItemInput’ class:

  • InputStreamItemInput – reads Items or batches of Items from an InputStream using an ‘ItemPacket’ format
  • ItemSpaceScannerItemInput – produces Items in sequence from a given ItemSpace under a given prefix

Dynamic Boolean Queries

An example of composition is a fast dynamic boolean-logic query view of any underlying wrapped ItemSpace. This requires only the appropriate compositions of the provided AndSpace and OrSpaces. The AndSpace uses an efficient ‘zig-zag’ algorithm, and it handles ‘negative’ inputs. Deep nestings can optimize and flatten themselves. The query needs no execution phase, but just ‘views’ the underlying ItemSpace, which may be dynamically mutated.

Component Encoding

When it is necessary to add semantics to an ItemSpace for applications, the Items are encoded to contain ‘components’. Each component represents a single data element in one of 12 types. Each component contains a type code, followed by its value in a variable-length format, with components packed in sequence in the Item with no wasted space at the start, internally or at the end. The encoding of components is permanently fixed, and has never changed. . There is no other structure in an Item beyond the sequence of components. Applications or users do not need to know the encoding. In many situations applications only deal with logical series’ of components, such as Object[]’s of components, that can be converted to actual encoded components in Items when needed. Each type has a unique standard corresponding immutable class to represent it, with the array types being represented by byte[] or char[] classes.

Basic Data Types

  • String – stored with terminating 0 and quoted internal 0’s.
  • Long  – stores all integral values, with smaller numbers taking less space. (handles byte, short, integer, char).
  • Float – the binary form is adjusted for proper sorting .
  • Double  – the binary form is adjusted for proper sorting .
  • Boolean (requires only one char – a ‘true’ and a ‘false’ type code)
  • Date – a millisecond time
  • char[] – a short array which must fit in an Item, with its length at the front
  • byte[] – a short array which must fit in an Item, with its length at the front
  • ByteString – like a string but binary for sorting purposes, with no initial length char
  • Index – a special kind of long that is used for ‘vertical’ array-like structures i.e. lists
  • EntityClass – a ‘non-primitive’ ‘metadata’ component that describes the semantics
  • Attribute – a ‘non-primitive’ ‘metadata’ component that describes the semantics

Multi-Item Structures

Sets of Items are used to represent arbitrarily large structures. Any such structure can be identified by its prefix, which is just the series of components that lead up to it: the structure itself is represented by the set of suffixes of Items having that prefix. Such a set of suffixes can be considered an ItemSpace in its own right, and can be called a ‘subspace’. Many of the structures are containers that can nest arbitrarily complex structures of any kind and size. All are just agreed-on patterns in the Items.  Hence, further structures can be transparently added by creative applications. Many structures can be looked at as various types at different times, as is convenient . BLOB/CLOB, Maps, Indexes and more have extensive support code.

Common Structures

  • BLOB’s or CLOB’s: ‘Binary Long OBject’ or ‘Character Long OBject’ are structured as an Index component determining a particular block, followed by a block in a byte[] or char[] component of up to 1024 elements.
  • Texts: an Index component designating a line number, followed by a String. The String is typically a ‘paragraph’ with up to about 1K chars. Tables, other texts, or any structure  can be nested.
  • Lists:  an Index component indicating the position of the elements in a ‘vertical’ logical array. These can be arbitrarily sparse and long. An element is determined by the set of suffixes after the index component, so elements can have any structure. Index components are represented by lists when in JSON.
  • Tables:  EntityClass and Attribute ‘metadata’ data types are used. Value ‘cells’ can nest any structure.
  • Trees:  a variable number of consecutive components represent a path in the tree. ‘Directories’ use Strings. The ‘leaf’ of each path is the set of suffixes, and can have any structure.
  • Maps:  divide the components in each Item such that at the left the components represent keys, and towards the right they are values.
  • Sets:  all of the suffixes of  Items having a given prefix can be considered elements of the set
  • MultiMaps: like a Map, but with values that use the Set structure above for multiple values per key. They can also nest any other structure
  • Tuples: a series of consecutive primitive components. Tuples of diverse lengths and component types can be mixed. Multiple tuples can occur within an Item depending on context, such as by being delimited by EntityClass and Attribute meta components. A zero-length tuple is allowed.
  • Indexes: a ‘sort/merge’ style structure allows efficient index creation with optional incremental dynamic update (See IncrementalMergingItemSpace and the other sorts.)

Entity-Attribute-Value Structures 

The EntityClass and Attribute ‘non-primitive’ components provide an optional elegant way to create very rich flexible structures as explained in InfinityDB Client/Server Java NoSQL Database. There, you will find a description and graphical views of the structures as can be seen in the server back-end database browser. These two special components allow schema metadata to be embedded in the Item rather than maintained externally. This allows a schema to be effectively extended at the moment an Item with a new structure enters or becomes visible in an ItemSpace, or likewise to have the schema effectively restored when an Item leaves or becomes invisible. In other words, the  schema provides ‘late binding’ of structures – in fact as late as possible. Such late binding is unique in database systems, and it therefore addresses a unique set of possible applications.  EntityClass and Attribute components are similar to longs or Strings but are encoded with a special initial identifier char. These meta components can be paired or grouped within any Item to logically describe ‘tables’, ‘sub-tables’, ‘sub-attributes’, or ‘nested tables’. Values can have any structure. This is effectively a Class-Entity-Attribute-Value model, with nesting.

 Adapters

The low-level ItemSpace view can be hidden and can be indirectly accessed via higher-level views.

Maps

All ItemSpace operations can be indirectly invoked via an optional Map model, based on an extended java.util.concurrent.ConcurrentNavigableMap – the most general and capable Map. One just constructs an InfinityDBMap or InfinityDBSet to wrap any ItemSpace under any prefix. There are many extensions. See InfinityDBMap Access and InfinityDBMap Example code. This access method is extremely powerful, and many applications make extensive, or even almost exclusive use of it. For example, the entire InfinityDB Client/Server backend database browser web site uses these Maps, as does its PatternQuery. The Map interface is dynamically interchangeable with the ItemSpace, and the two can co-exist in any application in any mixture.  They have exactly the same effective underlying ItemSpace mutation and retrieval capabilities. Maps are thread-safe and in fact multi-core concurrent.

Binary Long Objects and Character Long Objects

BLOB/CLOB access is simple, providing a familiar InputStream/OutputStream or Reader/Writer view. The utility classes BinaryLongObjectOutputStream, BinaryLongObjectInputStream, CharacterLongObjectWriter and CharacterLongObjectReader provide this. Images or files can easily be embedded in an ItemSpace. This helper code only manipulates Items in the underlying ItemSpace without using any outside storage. The structures can be placed under any component sequence prefix, hence can be nested in any outer structure as a set of suffixes and dealt with transparently by other code. BLOBs use the Index and byte[] types, while CLOB’s use Index and char[] components at the ends of Items. Index types are essentially a long constant with a special initial code, and the long identifies the position of each array component in the logical file, so BLOBs and CLOB’s are logically almost unlimited in size, yet use only the necessary space.  Any number of long objects may be written or read by different threads, but each one is single-threaded. As with all other structures, they are stored compressed.

Index Structures

Index structures are for things like text word lookup and other large-data purposes. The IncrementalMergingItemSpace code simply writes Items in a special structure in the underlying ItemSpace, while providing a virtual view to the client of a normal ItemSpace through which data to be indexed is easily written or retrieved.  Thus there is no complex special API, and any type of data can be stored transparently in the index structure. An efficient ‘sort/merge’ structure is used, so writes and reads are always efficient as the index grows without inherent limits, even beyond the block cache size. The structure is optimized for insert, even for non-localized Items, while delete is  slower but with consistent reasonable performance. The work on maintaining the index is done a little bit on each insert or delete.  Both inserts and deletes are consistent and visible immediately. Retrievals can be completely random,  with performance characteristics like the underlying ItemSpace but from 1 to 30 times slower. The index may be incrementally optimized. There are no temporary states of index sorting to cause delays, so the index is available and fast at all times, yet always space efficient. The index is single-threaded while being updated, but multi-threaded while only being read. Localized reads are faster. Interruptions or abrupt application exit leaves a consistent usable structure. As with all structures, an unlimited number of indexes may be identified by  component prefixes, may be nested inside other structures, and may be transparently handled by other code. Indexes are always stored compressed.

Interface

Sets of Items can be formatted, stored and communicated in various ways:

  • In InfinityDB database files: Items are prefix-compressed, packed, UTF-8 and ZLib-compressed, in variable-length blocks.
  • As JSON: any set of Items defined by a common prefix can be formatted and parsed as JSON. There are extensions to JSON to handle all of the 12 data types, plus an ‘underscore-quoted’ format to adapt to standard JSON syntax. Index components designate JSON arrays.
  • As ‘ItemPackets’:  a fast, checksummed binary stream of Items expressed as ItemOutput mutation operations for communication or buffering.
  • Tokenized: each component type has a default easily human-readable text format based on JSON or Java
  • As URL-quoted tokens: for RESTful access via Python, curl, or browser in the Client/Server edition. See this Picture (user ‘testUser’ and password  ‘db’) or this JSON
  • In an Object[], with elements being immutable Objects of classes appropriate to the component types
  • In a Cu ‘cursor’ in a private char[1665] with a private current length. This is used by the basic ItemSpace operations.

The Cu Class

A Cu ‘cursor’ instance holds one Item, for temporary manipulation, and it has no other state. The ItemSpace methods pass Items in and out via Cu references. The contained Item can be appended-to, parsed, truncated, and more. A Cu is allocated very quickly with Cu.alloc() and returned to the pool by Cu.dispose(), often for only a few operations.  A Cu can be thought of as a char[] of length 1665 with a current length. An example:

try (Cu cu = Cu.alloc()) {
    // a trivial Item - in practice it would be at least a few components
    cu.append("hello world");
    // if not already there, add the Item to the database, which is an ItemSpace
    db.insert(cu);
}

Appending to a Cu

Various things can be appended with Cu.append(…):

  • null, which does nothing
  • A primitive like long, int, short, byte, double, float, boolean (integrals become longs)
  • A primitive wrapper like Long, Double, Float, Boolean, String, or Date
  • A short array like byte[] or char[], which becomes a single component of array type
  • An array of Objects, which is appended by appending its elements according to these rules
  • A CuAppendable implementer, which provides a custom appendTo(Cu)
  • A Cu, either completely or after an offset or in a slice
  • Instances of ByteString, Index, EntityClass, or Attribute

Or, for special cases: appendXXX(…)

  • appendTime(long milliseconds), to append a Date
  • appendIndex(long index) to append an Index
  • appendDirect(…) for special extremely high speed purposes.

Parsing a Cu

Given an offset into a Cu, components can be extracted or skipped:

  • A single component of a known expected primitive type with Cu.typeAt(int off), or  Cu.skipType(int off). E.g. longAt(off).
  • A single component of a known expected InfinityDB class with Cu.TypeAt(int off). E.g. StringAt(off)
  • A single component of any InfinityDB class with Cu.componentAt(int off), or Cu.skipComponent(int off).
  • An Object[], of any InfinityDB primitive classes with Cu.compositeAt(int off), or Cu.skipComposite(int off)
  • An Object[], of any InfinityDB component classes  Cu.toComponentArray(int off)

In the above, an ‘InfinityDB class’ includes the primitive wrappers such as the standard Long, Double, and so on, but also String, Date, and the InfinityDB-specific classes ByteString, Index, EntityClass, and Attribute. A ‘primitive’ in InfinityDB is any component class but EntityClass and Attribute. ‘EntityClass’ is the InfinityDB Java class representing a conceptual class in the database. An example:

try (Cu cu = Cu.alloc()) {
    // create an example Item in cu. It could have been in an ItemSpace
    cu.append("hello").append(5).append(new Date());
    // Find the number 5. Its position is off chars from the start
    int off = cu.skipComponent(0);
    // get the 5. If it is not a long, throw an Exception
    long n = cu.longAt(off);
    // get the 5 without knowing that it is a long
    Object n = cu.componentAt(off);
    // get all of the components as an array
    Object[] components = cu.toComponentArray(0);
}

For more on the ItemSpace, see the Basic ItemSpace Example Code and see the original Manual. (Note that the component printed representation or ‘tokenized’ format of Dates has been changed to the ISO standard.) Also feel free to email support@boilerbay.com.

Abstract Discussion

Do you want to implement an ItemSpace by subclassing ItemSpace? Here are some suggestions.

Expected ItemSpace Characteristics

  • There is no pre-configuration for specific Item sizes, so any Item of any length can be inserted into or deleted from any mutable ItemSpace dynamically at any time.
  • Any ItemSpace can efficiently contain any mixture of Items including Items of any mixture of different lengths.
  • The sorted state is maintained at all times with no significant variable or unpredictable delay after mutation, either for further mutation or for retrieval.
  • Insertions and deletions may be mixed in any order  with Items in any order or mixture of lengths with no significant performance loss or variability.
  • Mutations are effectively atomic and immediate, except that uncertainties resulting from concurrency between Threads may slightly re-order the operations, in the absence of transactionality. 
  • Each mutation and each retrieval always succeeds, barring significant failures like out-of-space. (However, transactions can occasionally throw OptimisticLockConflictException).

Ideal ItemSpace Characteristics

There should be no fixed per-Item storage allocation, and ideally there is no need for configuration or tuning to adapt to particular situations such as different initially expected or dynamically changing length statistics.  ItemSpaces should operate well and correctly indefinitely, ideally with no interruptions, management, human attendance, or unlimited growing resource requirements.  Persistence is good but not required. The time for each basic operation is relatively constant, typically O(log(n)).  Items are normally stored with common prefixes compressed out for speed and space efficiency, because it is common to have many Items differing only near their ends. It is common for nearby Items – especially adjacent Items – to be mutated or retrieved close together in time, and such locality is typically taken advantage of with caches. Thread-safety and multi-core concurrency are important but not required. Mutability is valuable but not required.