The ItemSpace Data Model

If you are familiar with document databases using JSON, you will already know how flexible they are. InfinityDB can represent such documents and more, but instead of storing limited-size blobs of text, it stores unlimited sets of ‘Items’ which are strongly-typed ‘paths’ to the values. Groups of Items can also represent images, raw data, tables, trees, tuples, and other non-JSON structures. Here we discuss the model in depth, initially with a general definition but later in much more detail for software developers. Also see the definition of the ‘I-Language’ and the token forms, plus basic ItemSpace example code.

The Definition of ‘ItemSpace’

An ItemSpace represents a sorted set of ‘Items’ and there is no other state. An Item is a logical concept, like a number, and is immutable, meaning that if you change part of it it designates a different Item. An Item is a sequence of a few ‘components’ such as strings, numbers and 10 more types and they are variable in length. Here is the “tokenized” text form of an Item that connects a company to its address:

Company "Boiler Bay" location "US" "CA" "Soquel" "42 some street"

Individual Items can be inserted and deleted arbitrarily and retrieved one Item at a time bidirectionally in order or randomly based on a given nearest Item. A given Item occurs in a given ItemSpace at most once, and attempting to insert any pre-existing Item or to delete any already absent Item is legal and immediate but has no effect. An ItemSpace can be imagined to be represented as a single extended JSON, XML, or other hierarchical document, but quickly accessible at any level of detail and at any scale ranging over 13 orders of magnitude or more. In that form, each Item would represent the ‘path’ to a value in the document.

The meaning of the Item is primarily determined by the class and attribute components like Company and location, which are ‘meta components’. Complex structures use prefix sharing and mixtures of meta components within an Item. There is not necessarily more stored information about particular meta components, so they are ‘standalone’ and exist only to the extent that they are in Items in the ItemSpace. This particular Item has several adjacent non-meta components forming a ‘tuple’.

Representation

Items can be represented outside an ItemSpace in various ways, as graphics or text, although they are stored efficiently as common-prefix-compressed binary at the bottom level. For example, an Item can be in the ‘tokenized’ text form, taking one line of text, which is how it shows up in the database browser’s ‘current prefix’ text area for the user to see and edit as it tracks navigation through the database. Tokens are like “abc”, 5, 5.0, true and so on. In the i-code, JSON, CSV and other formats, the token forms also appear.

Item Components

The Item is divided up into a series of a few ‘components’ – such as zero to about 30 – each of which has one of 12 available data types. All of the types are defined to be comparable to any of the others, so every possible Item can be compared to any other possible Item without error. If two Items have a common prefix, then the components at the start of the suffixes determine the ordering, no matter what the types of the components. All of the data types that correspond to Java, JSON or JavaScript are represented the same way as in those languages. The types are, in order of the way they sort:

Type NameMeaningExamples in ‘Token’ Form
ClassA ‘Meta’ type use for a kind of ‘punctuation’ to separate the non-meta types, which are the ‘primitives’. Think of it as delimiting ‘tables’, although it is much more flexible. An ‘entity’ follows it. A capital letter, then letters, digits, dots, dashes and underscores. A historical, rarely seen form is numeric, like EntityClass(731).AlgaeImages, PublishedDocuments, Log, Scheduled_meetings, Scheduled.meetings, Scheduled-meetings
AttributeThe other ‘Meta’ type. Think of it as delimiting columns in a table, although it is much more flexible. A ‘value’ follows it. A lower case letter, then letters, digits, dots, dashes and underscores. A historical, rarely seen form is numeric, like Attribute(9328).logLocation, meeting_time, sample.size, temp-C, temp-F, address
StringCharacter sequence 0 to 1024 chars. Similar to Java, JavaScript, or JSON, but single quotes are not allowed as outside delimiters.“Hello World”, “Special chars \r\n \t”
Booleantrue or falsetrue, false
Float32-bit real number like double, always containing a dot but prints with a trailing ‘f’. IEEE-754 binary format. Not present in JavaScript, JSON, or Python. In JSON these are ‘underscore-quoted’ like “_5.3f”5.3f, 1.2E99f, 6.0f
Double64-bit real number. There is always a dot, even if it has a 0 fractional part. IEEE-754 binary format. Format like Java, JavaScript, and JSON. JavaScript has only this type. Unlike Python, the exponent defaults to upper case.5.3, 1.2E-99, 6.0
Long64-bit integer. Format like Java, JavaScript, or JSON. However, JavaScript and JSON have no 64-bit integer, only double, so JSON ‘underscore-quotes’ it like “_500”.0, 1, 500
DateDate and time, to 1ms since 1/1/1970 midnight GMT. Printed in ISO format, ending in timezone 07:00 or Z. A semi-standard milliseconds integer may be included after a dot after the seconds. The value is in the standard form of milliseconds starting at 1970-01-01T00:00:00Z.2018-03-02T16:00:09+00:00
BytesA sequence of 0 to 1024 bytes, which sort by their length, then by the bytes. Printed in capital hex. Used in BLOBs to contain the data. Stored as compressed binary.Bytes(A6_19_44)
ByteStringA sequence of 0 to 1024 bytes like Bytes but sort like strings, on the bytes themselves. Printed in capital hex. Stored as compressed binary.ByteString(A6_19_44)
CharsA sequence of 0 to 1024 16-bit chars, like string but sort on the initial length. Used in CLOBS. Stored as compressed binary.Chars(“Hello World”)
IndexLike a ‘long’ but has a special meaning to delimit elements of a list, where the elements are the sets of suffixes with the same prefix. Lists may be very long and the elements may be complex or large. Used for BLOBs and CLOBs and more.[0], [1], [39155164]

Blob Encoding

Blobs, like images, pdfs and any other file, are stored by breaking them into Items so they blend in with other data. The stored blob has a type and a list of byte array components. Within the database, blobs are tightly compressed. This structure is hidden in practice, such as in PatternQueries, where there is just a single ‘item’ type symbol to match it entirely; also the REST interface can optionally transfer single blobs directly as raw bytes at high speed. In Token form the Items for a blob look like:

Image “Oregon sea stacks” com.infinitydb.blob com.infinitydb.blob.data [0] Bytes(A6_44_38_….)
Image “Oregon sea stacks” com.infinitydb.blob com.infinitydb.blob.data [1] Bytes(41_23_9F_….)
Image “Oregon sea stacks” com.infinitydb.blob com.infinitydb.blob.mimeType “image/jpeg”

Underscore Quoted JSON

It is easy to embed Items into JSON for REST access: each Item represents a ‘path’. Just put an underscore before individual Tokens and surround each with double quotes to create each key. It is therefore possible to use any data type as a key or value, not just strings. String components do not use the underscore but if a string intentionally already starts with underscore, just ‘stuff’ another in front of it. When an Item contains an Index component whose token is like [0] or [99] it designates a list.

Long Items are intentional but not required, and they have many advantages in performance and semantics – see the examples. The i-code format is much easier to read and edit with longer Items, and in fact the JSON form is primarily used in the REST communication protocol.

Note that the keys and values are always sorted when JSON is printed, but may be in any order for parsing. On the client side, JSON may be parsed by the helper code and become JavaScript Objects, Python dicts, or custom Java Map-based classes. For more detail see REST APIs with PatternQueries. Here we have one Item as JSON, with the underscores before the class _Company and the attribute _location.

{
   "_Company" : {
        "Boiler Bay" : {
            "_location" : {
                 "US" : {
                      "CA" : {
                           "Soquel" : {
                                "42 some street" : null
                            }
                      }
                 }
             }
         }
    }
}       

The Component Binary Encoding

Most users only will need the description above, but as background here we discuss the storage level, which is provided by InfinityDB Embedded or InfinityDB Encrypted.

Binary Form

Users never need to deal with the fact that at bottom, an Item is stored as a packed sequence of chars, but instead users see Items formatted in some more meaningful way. The components are encoded in binary into variable-length sequences of chars from 0 to 1665 chars in length, packed and variable length for speed and simplicity. The simplicity of this bottom-level representation increases speed and compression considerably. In the memory cache, these are sorted and prefix-compressed.

Compression

The database file compresses these by sorting, removing common-prefixes, doing zlib compression, and packing into variable-length blocks. The compression ratio is from 1 to about 10x, but can be more for specific cases. For example, a genome database stores a sequence of A, C, T, or G ‘base pairs’, and these can be compressed down to about 1 byte even though the prefix is arbitrarily long, ending in a position number and a string with just one letter.

Component Formats Fixed

The formats of the component encodings are permanent, since changing them would make older databases incompatible with newer ones. No new data types will be added, so databases are always forwards and backwards compatible with software versions. This has been true since 2002 except that named classes and attributes were introduced with the InfinityDB Server – previously they were numbered. So the original version 4 InfinityDB Embedded from pre-2015 is not forwards compatible with databases created by post-2015 if named classes and attributes are stored, but now version 4, which is InfinityDB Embedded is compatible and is a drop-in replacement. Backwards compatibility has never been affected. For universal compatibility, just avoid named classes and attributes.

Minimal Introduction to the API

This section is the quickest possible introduction for programmers. (Note that when accessing InfinityDB Embedded in practice programmatically, the ConcurrentNavigableMap API is an important higher-level alternative that wraps an ItemSpace. The server uses it internally almost exclusively.) Below, the dark screen is the set of ‘Items’ in the ItemSpace, each being apparently a line of text but stored sorted uniquely and encoded into binary as distinct components. These are accessed by prefixes that are given in a cursor called a ‘Cu’ in light blue. The ‘next()’ operation moves the cursor one Item forwards. The blue carat below the cursor is a ‘protected prefix length’, which cause next() to return false if the portion to the left would be modified, in which case nothing happens.

There are four such essential retrieval operations for >, <, >=, and <=, plus ‘mutator’ operations to insert, delete, delete suffixes, and update according to the contents of the Cu. Much more is built on these.

The Layers

The system can be dealt with at several levels of abstraction:

LayerDescription
ApplicationClient code, systems, and architecture;
AdapterVirtual views that abstract and indirectly manipulate ItemSpace structures such as InfinityDBMap (which implements java.util.concurrent.ConcurrentNavigableMap), BinaryLongObjectOutputStream, BinaryLongObjectInputStream, CharacterLongObjectWriter, CharacterLongObjectReader, and other views;
StructureThe logical combination of multiple Items to represent high-level concepts. These include tables, trees, sets, graphs, indexes, texts, lists, byte and character streams, images and more;
ComponentsItems are interpreted as containing a series of components of 12 data types, encoded packed using standard variable-length formats. Items can have variable numbers of components of mixed data types;
ItemSpaceA mutable ordered set of Items, which are uninterpreted, variable-length raw char sequences up to 1665 in length. An Item is a logical concept, like a number, and is immutable.

ItemSpace API

ItemSpace is an abstract subclass of the abstract ItemOutput class. 

The ItemOutput class

There are four essential abstract mutation methods in ItemOutput. They do not return information about the results of the mutation, so that mutations may be transmitted without round-trip delays over a network, or when feedback is not available, such as into a temporary file or channel.

  • insert(item) – add an Item.  If the Item is already present, do nothing
  • delete(item) – remove an Item. If the Item is already absent, do nothing
  • deleteSubspace(prefixItem, boolean isDeletePrefix) – remove any and all pre-existing Items starting with prefixItem, and optionally the prefixItem itself.
  • update(insertedItem, int prefixLength, boolean isDeletePrefix)is an atomic: deleteSubspace(insertedItemUpToPrefixLength, isDeletePrefix), followed by insert(insertedItem).

The ItemSpace class

There are four essential abstract retrieval methods in ItemSpace. Each returns the Item in the ItemSpace nearest to a given Item. They return true iff a matching Item was found. A matching Item has the same chars as the given searchItem up to the given prefix length.

  • first(searchItem, int prefixLength) – get the nearestItem >= searchItem
  • next(searchItem, int prefixLength) – get the nearestItem > searchItem
  • last(searchItem, int prefixLength) – get the nearestItem <= searchItem
  • previous(searchItem, int prefixLength) – get the nearestItem < searchItem

Many convenience and default methods are provided by ItemOutput and ItemSpace based on the above. In many ItemSpaces, these operations are thread-safe, and often multi-core concurrent. For extreme speed, batches of Items can be read with nextItems(item, int prefixLength, byte[] buf, int off, int len) and written with writeItems(byte[] buf, int off, int len).  These use the fast ‘ItemPacket’ checksummed binary encoding for Items in buffers or streams and are implemented by default using the basic next(). The ItemSpace default methods are often overridden for speed or for other reasons.

ItemSpace Implementations

ItemSpace implementations can ‘wrap’ or delegate to others, or provide special functionality. Some provide ‘views’ of others according to various rules. Many problems can be solved by simply composing ItemSpaces of various kinds. Another route is custom subclassing of ItemSpace, often in order to wrap instances of one or more other ItemSpaces. It is easy, however, to write applications using only the default InfinityDB database file ItemSpace, and none of those below.

These ItemSpaces are provided with InfinityDB:

  • InfinityDB database exposes its data as an ItemSpace, adding persistence, transactionality, concurrency,  and much more
  • AndItemSpace views the logical intersection of multiple given ItemSpaces
  • OrSpace views the logical union of multiple given ItemSpaces.
  • AndOrItemSpace is an optimized nested AndItemSpace and OrItemSpace with no nesting limit
  • IndirectItemSpace delegates to a given ItemSpace
  • ItemSubspace – views the suffixes of a given prefix of a given ItemSpace
  • RangeItemSpace – views a given ItemSpace over a fixed range of Items
  • VolatileItemSpace – is a simple in-memory ItemSpace, not concurrent or transactional
  • DeltaItemSpace – views a given static ItemSpace as if it were mutable, optionally applying mutations later
  • TransactionalItemSpace – handles transactions for InfinityDB
  • ReadOnlyItemSpace – throws Exception on mutations
  • EmptyItemSpace – always contains nothing
  • BufferedItemSpace – batches mostly-sequential operations, such as for remote access (Client/server)
  • RangeBufferedItemSpace – batches mostly-associated retrievals, such as for remote access (Client/server)
  • RemoteClientItemSpace – a transparently remote ItemSpace over the ‘ItemPacket’ protocol (Client/server)

The provided ItemOutputs are:

  • DiskBasedMergeSorterItemOutput – does time-efficient sorting of large datasets
  • ItemSpaceBasedMergeSorterItemOutput – does space-efficient sorting of large datasets
  • OutputStreamItemOutput – writes mutations to an OutputStream in an ‘ItemPacket’ format

There is also the independent ‘ItemInput’ class:

  • InputStreamItemInput – reads Items or batches of Items from an InputStream using an ‘ItemPacket’ format
  • ItemSpaceScannerItemInput – produces Items in sequence from a given ItemSpace under a given prefix

Dynamic Boolean Queries

An example of composition is a fast dynamic boolean-logic query view of any underlying wrapped ItemSpace. This requires only the appropriate compositions of the provided AndSpace and OrSpaces. The AndSpace uses an efficient ‘zig-zag’ algorithm, and it handles ‘negative’ inputs. Deep nestings can optimize and flatten themselves. The query needs no execution phase, but just ‘views’ the underlying ItemSpace, which may be dynamically mutated.

Component Encoding

When it is necessary to add semantics to an ItemSpace for applications, the Items are encoded to contain ‘components’. Each component represents a single data element in one of 12 types. Each component contains a type code, followed by its value in a variable-length format, with components packed in sequence in the Item with no wasted space at the start, internally or at the end. The encoding of components is permanently fixed, and has never changed. . There is no other structure in an Item beyond the sequence of components. Applications or users do not need to know the encoding. In many situations applications only deal with logical series’ of components, such as Object[]’s of components, that can be converted to actual encoded components in Items when needed. Each type has a unique standard corresponding immutable class to represent it, with the array types being represented by byte[] or char[] classes.

Basic Data Types

  • EntityClass – a ‘non-primitive’ ‘metadata’ component that describes the semantics
  • Attribute – a ‘non-primitive’ ‘metadata’ component that describes the semantics
  • String – stored with terminating 0 and quoted internal 0’s.
  • Boolean (requires only one char – a ‘true’ and a ‘false’ type code)
  • Float – the binary form is adjusted for proper sorting .
  • Double  – the binary form is adjusted for proper sorting .
  • Long  – stores all integral values, with -10..100 taking one char. (replaces byte, short, integer, char).
  • Date – a millisecond time as a long
  • byte[] – a short array which must fit in an Item, with its length at the front
  • ByteString – like a string but binary for sorting purposes, with no initial length char
  • char[] – a short array which must fit in an Item, with its length at the front
  • Index – a special kind of long that is used for ‘vertical’ array-like structures i.e. lists

Multi-Item Structures

Sets of Items are used to represent arbitrarily large structures. Any such structure can be identified by its prefix, which is just the series of components that lead up to it: the structure itself is represented by the set of suffixes of Items having that prefix. Such a set of suffixes can be considered an ItemSpace in its own right, and can be called a ‘subspace’. Many of the structures are containers that can nest arbitrarily complex structures of any kind and size. All are just agreed-on patterns in the Items.  Hence, further structures can be transparently added by creative applications. Many structures can be looked at as various types at different times, as is convenient . BLOB/CLOB, Maps, Indexes and more have extensive support code.

Common Structures

  • BLOB’s or CLOB’s: ‘Binary Long OBject’ or ‘Character Long OBject’ are structured as an Index component determining a particular block, followed by a block in a byte[] or char[] component of up to 1024 elements.
  • Texts: an Index component designating a line number, followed by a String. The String is typically a ‘paragraph’ with up to about 1K chars. Tables, other texts, or any structure  can be nested.
  • Lists:  an Index component indicating the position of the elements in a ‘vertical’ logical array. These can be arbitrarily sparse and long. An element is determined by the set of suffixes after the index component, so elements can have any structure. Index components are represented by lists when in JSON.
  • Tables:  EntityClass and Attribute ‘metadata’ data types are used. Value ‘cells’ can nest any structure.
  • Trees:  a variable number of consecutive components represent a path in the tree. ‘Directories’ use Strings. The ‘leaf’ of each path is the set of suffixes, and can have any structure.
  • Maps:  divide the components in each Item such that at the left the components represent keys, and towards the right they are values.
  • Sets:  all of the suffixes of  Items having a given prefix can be considered elements of the set
  • MultiMaps: like a Map, but with values that use the Set structure above for multiple values per key. They can also nest any other structure
  • Tuples: a series of consecutive primitive components within an Item. Tuples of diverse lengths and component types can be mixed. Multiple tuples can occur within an Item depending on context, such as by being delimited by EntityClass and Attribute meta components. A zero-length tuple is allowed.
  • Indexes: a ‘sort/merge’ style structure allows efficient index creation with optional incremental dynamic update (See IncrementalMergingItemSpace and the other sorts.)

Entity-Attribute-Value Structures 

The Class and Attribute ‘non-primitive’ components provide an optional elegant way to create very rich flexible structures as explained in InfinityDB Client/Server Java NoSQL Database. There, you will find a description and graphical views of the structures as can be seen in the server back-end database browser. These two special components allow schema metadata to be embedded in the Item rather than maintained externally. This allows a schema to be effectively extended at the moment an Item with a new structure enters or becomes visible in an ItemSpace, or likewise to have the schema effectively restored when an Item leaves or becomes invisible. In other words, the  schema provides ‘late binding’ of structures – in fact as late as possible. Such late binding is unique in database systems, and it therefore addresses a unique set of possible applications.  Class and Attribute components are similar to longs or Strings but are encoded with a special initial identifier char. These meta components can be paired or grouped within any Item to logically describe ‘tables’, ‘sub-tables’, ‘sub-attributes’, or ‘nested tables’. Values can have any structure. This is effectively a Class-Entity-Attribute-Value model, with nesting.

 Adapters

The low-level ItemSpace view can be hidden and can be indirectly accessed via higher-level views.

Maps

All ItemSpace operations can be indirectly invoked via an optional Map model, based on an extended java.util.concurrent.ConcurrentNavigableMap – the most general and capable Map. One just constructs an InfinityDBMap or InfinityDBSet to wrap any ItemSpace under any prefix. There are many extensions. See InfinityDBMap Access and InfinityDBMap Example code. This access method is extremely powerful, and many applications make extensive, or even almost exclusive use of it. For example, the entire InfinityDB Client/Server backend database browser web site uses these Maps, as does its PatternQuery. The Map interface is dynamically interchangeable with the ItemSpace, and the two can co-exist in any application in any mixture.  They have exactly the same effective underlying ItemSpace mutation and retrieval capabilities. Maps are thread-safe and in fact multi-core concurrent.

Binary Long Objects and Character Long Objects

BLOB/CLOB access is simple, providing a familiar InputStream/OutputStream or Reader/Writer view. The utility classes BinaryLongObjectOutputStream, BinaryLongObjectInputStream, CharacterLongObjectWriter and CharacterLongObjectReader provide this. Images or files can easily be embedded in an ItemSpace. This helper code only manipulates Items in the underlying ItemSpace without using any outside storage. The structures can be placed under any component sequence prefix, hence can be nested in any outer structure as a set of suffixes and dealt with transparently by other code. BLOBs use the Index and byte[] types, while CLOB’s use Index and char[] components at the ends of Items. Index types are essentially a long constant with a special initial code, and the long identifies the position of each array component in the logical file, so BLOBs and CLOB’s are logically almost unlimited in size, yet use only the necessary space.  Any number of long objects may be written or read by different threads, but each one is single-threaded. As with all other structures, they are stored compressed.

Index Structures

Index structures are for things like text word lookup and other large-data purposes. The IncrementalMergingItemSpace code simply writes Items in a special structure in the underlying ItemSpace, while providing a virtual view to the client of a normal ItemSpace through which data to be indexed is easily written or retrieved.  Thus there is no complex special API, and any type of data can be stored transparently in the index structure. An efficient ‘sort/merge’ structure is used, so writes and reads are always efficient as the index grows without inherent limits, even beyond the block cache size. The structure is optimized for insert, even for non-localized Items, while delete is  slower but with consistent reasonable performance. The work on maintaining the index is done a little bit on each insert or delete.  Both inserts and deletes are consistent and visible immediately. Retrievals can be completely random,  with performance characteristics like the underlying ItemSpace but from 1 to 30 times slower. The index may be incrementally optimized. There are no temporary states of index sorting to cause delays, so the index is available and fast at all times, yet always space efficient. The index is single-threaded while being updated, but multi-threaded while only being read. Localized reads are faster. Interruptions or abrupt application exit leaves a consistent usable structure. As with all structures, an unlimited number of indexes may be identified by  component prefixes, may be nested inside other structures, and may be transparently handled by other code. Indexes are always stored compressed.

Interface

Sets of Items can be formatted, stored and communicated in various ways:

  • In InfinityDB database files: Items are prefix-compressed, packed, UTF-8 and ZLib-compressed, in variable-length blocks.
  • As JSON: any set of Items defined by a common prefix can be formatted and parsed as JSON. There are extensions to JSON to handle all of the 12 data types, plus an ‘underscore-quoted’ format to adapt to standard JSON syntax. Index components designate JSON arrays.
  • As ‘ItemPackets’:  a fast, checksummed binary stream of Items expressed as ItemOutput mutation operations for communication or buffering.
  • Tokenized: each component type has a default easily human-readable text format based on JSON or Java
  • As URL-quoted tokens: for RESTful access via Python, curl, or browser in the Client/Server edition. See this Picture (user ‘testUser’ and password  ‘db’) or this JSON
  • In an Object[], with elements being immutable Objects of classes appropriate to the component types
  • In a Cu ‘cursor’ in a private char[1665] with a private current length. This is used by the basic ItemSpace operations.

The Cu Class

A Cu ‘cursor’ instance holds one Item, for temporary manipulation, and it has no other state. The ItemSpace methods pass Items in and out via Cu references. The contained Item can be appended-to, parsed, truncated, and more. A Cu is allocated very quickly with Cu.alloc() and returned to the pool by Cu.dispose(), often for only a few operations.  A Cu can be thought of as a char[] of length 1665 with a current length. An example:

try (Cu cu = Cu.alloc()) {
    // a trivial Item - in practice it would be at least a few components
    cu.append("hello world");
    // if not already there, add the Item to the database, which is an ItemSpace
    db.insert(cu);
}

Appending to a Cu

Various things can be appended with Cu.append(…):

  • null, which does nothing
  • A primitive like long, int, short, byte, double, float, boolean (integrals become longs)
  • A primitive wrapper like Long, Double, Float, Boolean, String, or Date
  • A short array like byte[] or char[], which becomes a single component of array type
  • An array of Objects, which is appended by appending its elements according to these rules
  • A CuAppendable implementer, which provides a custom appendTo(Cu)
  • A Cu, either completely or after an offset or in a slice
  • Instances of ByteString, Index,Class, or Attribute

Or, for special cases: appendXXX(…)

  • appendTime(long milliseconds), to append a Date
  • appendIndex(long index) to append an Index
  • appendDirect(…) for special extremely high speed purposes.

Parsing a Cu

Given an offset into a Cu, components can be extracted or skipped:

  • A single component of a known expected primitive type with Cu.typeAt(int off), or  Cu.skipType(int off). E.g. longAt(off).
  • A single component of a known expected InfinityDB class with Cu.TypeAt(int off). E.g. StringAt(off)
  • A single component of any InfinityDB class with Cu.componentAt(int off), or Cu.skipComponent(int off).
  • An Object[], of any InfinityDB primitive classes with Cu.compositeAt(int off), or Cu.skipComposite(int off)
  • An Object[], of any InfinityDB component classes  Cu.toComponentArray(int off)

In the above, an ‘InfinityDB class’ includes the primitive wrappers such as the standard Long, Double, and so on, but also String, Date, and the InfinityDB-specific classes ByteString, Index, Class, and Attribute. A ‘primitive’ in InfinityDB is any component class but Class and Attribute. ‘EntityClass’ is the InfinityDB Java class representing a conceptual class in the database. An example:

try (Cu cu = Cu.alloc()) {
    // create an example Item in cu. It could have been in an ItemSpace
    cu.append("hello").append(5).append(new Date());
    // Find the number 5. Its position is off chars from the start
    int off = cu.skipComponent(0);
    // get the 5. If it is not a long, throw an Exception
    long n = cu.longAt(off);
    // get the 5 without knowing that it is a long
    Object n = cu.componentAt(off);
    // get all of the components as an array
    Object[] components = cu.toComponentArray(0);
}

For more on the ItemSpace, see the Basic ItemSpace Example Code and see the original Manual. (Note that the component printed representation or ‘tokenized’ format of Dates has been changed to the ISO standard.) Also feel free to email support@boilerbay.com.

Abstract Discussion

Do you want to implement an ItemSpace by subclassing ItemSpace? Here are some suggestions.

Expected ItemSpace Characteristics

  • There is no pre-configuration for specific Item sizes, so any Item of any length can be inserted into or deleted from any mutable ItemSpace dynamically at any time.
  • Any ItemSpace can efficiently contain any mixture of Items including Items of any mixture of different lengths.
  • The sorted state is maintained at all times with no significant variable or unpredictable delay after mutation, either for further mutation or for retrieval.
  • Insertions and deletions may be mixed in any order  with Items in any order or mixture of lengths with no significant performance loss or variability.
  • Mutations are effectively atomic and immediate, except that uncertainties resulting from concurrency between Threads may slightly re-order the operations, in the absence of transactionality. 
  • Each mutation and each retrieval always succeeds, barring significant failures like out-of-space. (However, transactions can occasionally throw OptimisticLockConflictException).

Ideal ItemSpace Characteristics

There should be no fixed per-Item storage allocation, and ideally there is no need for configuration or tuning to adapt to particular situations such as different initially expected or dynamically changing length statistics.  ItemSpaces should operate well and correctly indefinitely, ideally with no interruptions, management, human attendance, or unlimited growing resource requirements.  Persistence is good but not required. The time for each basic operation is relatively constant, typically O(log(n)).  Items are normally stored with common prefixes compressed out for speed and space efficiency, because it is common to have many Items differing only near their ends. It is common for nearby Items – especially adjacent Items – to be mutated or retrieved close together in time, and such locality is typically taken advantage of with caches. Thread-safety and multi-core concurrency are important but not required. Mutability is valuable but not required.