InfinityDB Formally

Here we attempt to formalize part of InfinityDB a bit. For less formality see the full description in The ItemSpace Data Model and for the details of PatternQuery see PatternQuery Reference, PatternQuery Examples, and PatternQuery Perspective. Also the simple i language.

Components

The domain of discourse is the set of possible components C, which is the set of tuples (t, v), t in the set of 12 data types T, v in the possible values of the type t. The types are: EntityClass, Attribute, string(0..1024), boolean, float(32bits), double(64bits), long(64bits), date, byte array(0..1024), byte string(0..1024), char array(0..1024), and index(64 bits). The ‘Token’ representation of EntityClasses match [A-Z][a-zA-Z0-9-_.]* Attributes match [a-z][a-zA-Z0-9-_.]* and see The ItemSpace Data Model for the rest. Components are comparable by magnitude, no matter their data types, according to the fixed ordering of the data types above. The ordering is determined by t, and if equal, then by v. In practice, components are atomic and immutable, although their types and values and other things can be determined in expressions.

Items

Items are described here as tuples of components of variable lengths i.e. {c[n]}, c[n] in C, for all n >-=0, n < item_length or (c[0],c[1]..c[n]). The positions of the components in the tuples are zero-based. This idea of tuples comes from the relational algebra of DBMS, in which tuples may not contain other tuples, and the elements are all from the data type domain C. Also, we number the elements, unlike relational algebra, which formally just associates one unique element of a ‘heading’ set with each tuple element. There is no idea of a Null, undefined or empty element, so every n >=0, n < item_length has a corresponding c[n] in C. The empty Item () is allowed. Expressions may evaluate to ‘undefined’ in some cases, but undefined cannot occur in Items, so trying to construct an Item (x,y) where x or y is undefined evaluates to undefined.

The limit on persistent item length is actually determined by the space taken by encoding the components of an Item, with a maximum of 1665 chars, but in expression evaluation they can be very long.

We call Items tuples here for formality, but elsewhere we define an Item to be divided into a series of tuples punctuated by the two ‘meta components’ EntityClass and Attribute which are ‘non-primitive’ so the other 10 types are ‘primitive’.

You could call an Item a list but it is immutable, and changing any part of it would transform it into a different Item. It is like a number. There is no nesting of Items in Items, unlike, say tuples or lists in Python. Nesting is legal in expressions such as (w,(x,y),z) but this evaluates to the Item (w,x,y,z) and ((x)) = (x) = x, where w,x,y,z are components. In expressions, there are functions x.get(index), x.get(start,after_end), x.head(after_end), and x.tail(start), with negative indexes being from the end, returning Items. Some operators work by matching the indexes implicitly, so (1,2) + (3,4) = (4,6), and 5 * (1, 2, 3) = (5, 10,15). Note that Items are also not vectors, which normally would have a single data type for all elements, unless you consider a vector of C.

Singleton Items

  • An Item of length 1 is a ‘singleton’. It is indistinguishable from an isolated component. Thus there is no way to convert between singletons and isolated components since they are identical. So (c) = c, c in C. Also c.get(0) = c. (Historically, a singleton can optionally be written (x,) but this may be misleading).
  • An Item of length > 1 is ‘composite’ and is not a singleton.
  • The Item of length 0 is the ’empty’ Item and is not a singleton.

Item Ordering

  • Items are equal if they are equal tuples;
  • otherwise item1 < item2 if item1 is a prefix of item2, i.e. length(item1) < length(item2) and for all n item1[n] = item2[n], n >= 0, n < length(item1). We also say item2 is an extension of item1;
  • otherwise, item1 < item2 if there exists an m such that for all n item1[n] = item2[n], and item1[m] < item2[m], n >= 0, n < m.

ItemSpace

An ItemSpace is a totally ordered subset of the set of tuples C^n, n >= 0, n < about_30. Thus it is a set of Items totally ordered by the Items. Databases are persistent ItemSpaces, and there are many other contexts for them. An ItemSpace can be thought of as mapping to a single large document such as extended JSON, but scaling smoothly over 13 orders of magnitude or more.

Subspace

An ItemSpace s1 is a subspace of an ItemSpace s2 on a prefix p if it is a maximal set of Items such that for each i1 in s1, the item (p,i1) is in s2.

Complex ‘Fractal’ Structures

Subspaces are logically but not physically nested to create a hierarchy like ‘fractals’. Also, the EntityClass and Attribute meta component types are mixed into the Items to describe structures. Adjacent components within an Item of the form (meta, tuple, meta, tuple) where a tuple is zero or more primitives are sometimes informally described this way:

  • (EntityClass, entity, Attribute, value) is called a ‘table’;
  • (EntityClass, entity, EntityClass, entity) is called a ‘sub-table’;
  • (Attribute, value, Attribute, value) is called a ‘sub-attribute’; and
  • (Attribute, value, EntityClass, entity) is called a ‘nested table’.

A component more towards the end of an Item is ‘inner’ and one more towards the start is ‘outer’. (You might have expected the reverse. Imagine a tree of nested loops, where there are outer and inner loops.)

Basic Operations

The minimal ItemSpace operations run in log(totalItemCount) time. There are mutator methods on class ItemOutput which is the superclass of ItemSpace:

  • insert(item) put item into the ItemSpace
  • delete(item) remove item from the ItemSpace
  • update(item, prefixLength, includePrefix) is like deleteSubspace(item[0:prefixLength], includePrefix), insert(item)
  • deleteSubspace(item, includePrefix) is like delete(item2) where item = item2[0:length(item)], and iff includePrefix then delete(item)

The minimal retrievals on class ItemSpace pass parameter item by reference and modify only item[prefixLength:]. They return true if item is actually modified and also return true for the equals cases if item exists:

  • first(item, prefixLength) change item to nearest existing item >= item.
  • next(item, prefixLength) change item to nearest existing item > item
  • last(item, prefixLength) change item to nearest existing item <= item
  • previous(item, prefixLength) change item to nearest existing item < item.

PatternQuery

A PatternQuery contains a where, a pattern, and a result. It is represented by the contents of an ItemSpace, often a subspace of another, in which its definition is the subspace on the prefix query, an Attribute component type. Thus PatternQueries can exist in a database; for example they can be with the data they apply to. To be executable, they must be stored in the database directly on a prefix (Query,interface,method) where Query is a constant EntityClass component, and interface and method are strings. Being ‘direct’ means being in an outermost subspace. The interface is one or more dot-separated components, where each component is one or more lower case letters, digits, or dashes. This is a close superset of valid internet domain names.

Where

The where is a subspace of the query definition subspace on the prefix Where, an EntityClass component type.

The where is equivalent to a relation (symbol id, symbol_attribute,value), where a symbol id is a normal string programming language identifier, a mixture of letters, digits, and underscore not starting with a digit, and a symbol_attribute is an element of a fixed set of symbol attributes such as kind, type, and more. Some symbol attributes require that the value be a function of the symbol id, some require each value to be a singleton, some restrict the types of each value to a certain set, while some allow the value to be a constant or expression or any combination of such restrictions. The symbol attributes determine the semantics of the symbols within the pattern and result. For details see PatternQuery Reference.

Symbol Kinds

The optional kind symbol attribute determines the characteristics of the symbol and the meaning of an allowed subset of the predefined symbol attributes. The symbol attributes each have similar semantics across all kinds. The categories of kinds are:

  • the default i.e. ‘plain’ kind, which participates in matching and is a ‘loop’. Others are ‘special’ kinds;
  • the series kind which generates integers and is also a ‘loop’;
  • ‘stateful’ kinds including reducers, hashes, randoms, timers, counters, summers, statistics, and more, which compute over multiple Items according to the matching;
  • ‘root’ kinds to select a subset of the set of predefined input and output ItemSpaces; and
    • no-op, filter and echo kinds.

Pattern

The pattern tree is a subspace of the query definition on the prefix pattern, an Attribute component type, where the string components have additional semantics:

  • an initial equals followed immediately by a symbol id is a symbol reference
  • an initial equals followed by a java-like expression other than an immediate single symbol id is a pattern expression. It contains operators and symbol ids.
  • an initial group of at least two equals is a representation of a literal string with one less initial equals.
  • otherwise it is a literal string

A pattern is one Item in the pattern tree. The pattern space is the ‘input’ ItemSpace, which is normally the database or a subspace of the database, but can include other ‘root’ spaces, like the request content, the request header, the query definition space itself, a local ‘WithData’ subspace associated with the query, and a request url query string parameter. A root space is indicated by a symbol at the start of a pattern or result that is one of the root symbol kinds.

Result

The result tree is a subspace of the query definition on the prefix result, an Attribute component type. The strings in it have the same rules as for pattern. A result is one Item in the result tree. The result space is the ‘output’ ItemSpace, which is normally the database or a subspace of it, but may also include other ‘root’ ItemSpaces such as the response content or response header.

Matching

Matching is the process by which the executor finds Items in the input ItemSpace that satisfy the constraints determined by the pattern and the Where. A partial match occurs when for each p[n] in the pattern p and the corresponding component i[n] in an input item i (the position n skips special kind components within p):

  • p[n] is a plain symbol reference and the Where constraints are satisfied given i[n], in which case its value is bound to i[n]
  • p[n] is a literal and it is equal to i[n]
  • p[n] is an expression, and the value that it evaluates to equals i[n] based on the bindings of the symbols it references

A complete match occurs for every combination of input items that match all of the patterns. An additional constraint is that multiple symbol references of the same plain symbol other than in an expression must match the same value: this is a ‘join’. Only the plain symbol kind can repeat in the pattern tree. There are many symbol attributes that limit the matching further.

Firing Results

When a complete match occurs, results fire. Each result fires by creating an Item based on the literals, expressions, and symbols it contains and applying it to the result space, given the particular bound symbol values. Results may have various effects on the result space based on the action symbol attribute of their terminal symbol reference. If a result starts with a root symbol, the associated space is the destination. The results are independent, meaning that executing the query q is the equivalent of: for each result r in query q, execute q’ where q’ is q reduced to having only r.

Feedback and Steps

Usually at least one result space is the same as a pattern space, so it is possible for a result’s action to affect the matching, which is called ‘feedback’. Feedback is data dependent, and the results are not independent. To control feedback, a step symbol attribute can specify a set of constant integers, and the query is executed repeatedly in phases i.e. steps by increasing step number. The symbols with the step attribute occur at the ends of the appropriate results to identify them.(The query is pruned on each step to minimize the work.) Steps are very useful beyond resolving feedback.

Cross Products

The matching is logically a complete cross product of all Items and would sometimes take too long to compute practically, so the compiler prunes the search as much as possible. This may inherently not be an issue: for example, often there is a direct relationship between a pattern and a result and the matching only looks at one input Item at a time and for each match it fires the result at most once.

When writing PatternQueries, the author must consider whether the cross product may be too large, but normally the appropriate joining and other constraints will prevent that. The query compiler can print a plan showing an estimated number of matches, and the author can provide explicit fanout estimates to make this more precise if desired. Symbol attributes can limit the matching in various ways; for example, these can help: type, first, last, limit, values, max_values, min_values, equals, from, from_exclusive, to, to_exclusive, required, all_must_match, none_must_match, and enable. Also filter kind symbols can be used.

Execution

A useful concept is that cross products actually occur according to disjoint ‘join sets’ rather than over the entire set of patterns . A join set is a maximal subset of patterns connected directly or indirectly by symbol references. Thus a join set is:

‘join set’ = a component of the transitive closure of the symmetric binary relation (p1, p2) where s1 is in p1 and s2 is in {p2, r, an expression e1 in p2, an expression e2 in r} where p1, p2 are patterns, s1,s2 are symbols, and r is a result.

Join sets partition the patterns. For each pattern in a join set, logically every possible Item in the pattern space is tried to find a complete match. In practice fewer are tried. The compiler converts each join set into a set of nested loops, each loop containing only one inner loop. Each result is associated with one such nested set, but some nested sets handle multiple results.

The compiler combines nested sets for the results on common prefixes, resulting in a tree, called the VisitorNodeList. The compiler optimizes the depth-wise ordering of the of the loops in the nested sets to maximize common prefixes and to maximize the efficiency of matching. There is some optimization similar to ‘predicate push-down’ in RDBMS. The depth-wise ordering can be aided by hints from the author called ‘fanout’ symbol attributes.

Joins for Free

In RDBMS, a one-to-many relationship is created by foreign key references between multiple tables, but in InfinityDB, some one-to-many relationships are implied by the nesting of subspaces. When symbols are nested, there is no extra work in ‘locating the many given the one’. More nesting compounds this speedup. In RDBMS on the other hand, indexes may be used to speed up access in selected one-to-many relationships but they normally ‘nest only once’ because each index encodes a single such relationship. Furthermore, each RDBMS index access requires yet another random access step to reach a base table row. Furthermore, InfinityDB symbols having a common prefix, i.e. being in a common superspace, can be joined more easily – sometimes ‘for free’.

An alternative in RDBMS for joins and intersections, such as hash join, can be far slower, because it must read all rows of the large table for each hash buffer extracted from the small table, where the small table is partitioned into a minimal set of hash buffers. Thus this is actually a full cross product, but it is linearly faster in the size of the hash buffers, so the performance increases as the hash buffer grows. When the entire small table fits in a hash buffer, then still the entire large table must be scanned once.

Zig-Zag Joins And Logic

For speed and semantics, a special ‘zig-zag’ algorithm is available that performs all joins very efficiently, and which can be used also for very efficient intersections and differences over any pair of adjacent plain symbols. The optional ‘logic’ symbol attribute can be put on the outer plain symbol and can have a value in {‘or’, ‘and’, ‘not’}. The ‘or’ is default. Below, match(v2) means that v2 is allowed to have that value in its symbol. The tuple (v1, v2) means that those values are in the input, given that all matching criteria with respect to their prefix are already applied, as well as the join restriction.

AND Logic: for each v2: iff for all v1: (v1, v2) exists then match(v2). 

NOT logic: for each v2: iff for all v1: (v1, v2) does not exist then match(v2).

OR logic: for all v2: if for any v1: (v1, v2) exists then match (v2). 

The ‘not’ logic is apparently self-defeating, but in case v2 is part of a join it is very useful as described below in combining logic.

Examples of logic are in the demo/readonly database and PatternQuery Examples. There are there a few ‘allowed-required-excluded’ type queries. These allow an input set of criteria in the request content to control the selection of certain things in the database. The criteria are grouped into ‘allowed’, ‘required’ and ‘excluded’ sets. This is an intuitive way to get an effect similar to but not as general as user definition of boolean functions.

Combining Logic

To combine logic with symbols s1, s2, s3 and j (for a join symbol) with prefix, being an arbitrary prefix and rest1, rest2 rest3 being arbitrary inner patterns one can say:

Where { 's1' logic 'and'; 's2' logic 'not'; 's3' logic 'or'; }
pattern prefix { =s1  =j rest1; =s2 =j rest2; =s3 =j rest3; }

Then the set of values vj’ of symbol j are restricted to:

vj’ = {vj}, where for all v1 in s1 (v1,vj) exists, for all v2 in s2 (v2,vj) does not exist, for any v3 in s3, (v3, vj) exists.

In the above, the OR logic does have a semantic effect. We also assume the effects of the joining on j are already taken into account, so {vj} is already restricted to have equal values in all of the occurrences of its symbol j.

Zig-Zag Algorithm Efficiency

The zigzag is extremely efficient, and it computes a join in at most the number of elements in the smallest input set times the number of input sets, where each input set is determined by one of the join symbol occurrences. At minimum, it takes only one test of each input set, when there is no match. The join is computed repeatedly, each time a prefix is matched in the recursive visiting. This may be a speedup of many orders of magnitude. A join is a form of intersection by definition.

For AND logic i.e. intersections, the algorithm is similarly fast except that the number of input sets is the number of values in the immediately outer symbol at that point in the recursive visiting of the pattern tree. An intersection can be computed combined with a join of the immediately inner symbol.

NOT logic i.e. difference is a kind of ‘filter’ applied during the computation of the join and intersections and is not as efficient as join or intersection but is very effective when combined with them.

These can be nested within a query and can speed up the entire query multiplicatively.

Temporary memory is used for intersection and difference proportional to the fanout of the outer symbol. This is freed every time the recursive matching returns from evaluating the inner pattern. This may be many orders of magnitude smaller than the total size of the input sets. The temporary memory is not dependent on the total number of values ever visited by the outer symbol but is reduced by the matching already performed on the prefix of the symbol.

All_must_match

The symbol attributes all_must_match and none_must_match restrict the matching of a symbol depending on the matching of the inner parts of the pattern. Thus it is a form of lookahead. Each value of the symbol is tested by binding it to the symbol, and then attempting to match the inner patterns. The testing may terminate early on a failure to match (or not match) on the inner patterns.

Where 's' all_must_match true;
pattern outer =s inner;

The above means:

for each Item i0 matching outer: if (for all v in s: there exists an i1 matching inner) then all v in s match

There is a silent pre-pass in order to recurse into the inner patterns and then a backtracking to the symbol, followed by a conditional regular match inwards. These can be nested and mixed, and the pre-pass occurs only for the outermost symbol, with the inner nested ones handled within the same pre-pass. The pre-pass may be much faster than the regular matching because it can quit early since it only tries to find whether there is a non-match (or a match) for every possible binding of the symbol’s value.

Symbol Adjacencies

In general, in compiling for a given result, we are trying to rewrite a tree of pattern loops to flatten out all of the branches into a single branch of deeply nested loops so we can test all possible combinations of matching Items. However, in certain situations this is difficult. We also want to apply depth-wise re-ordering to that single-branch of nested loops for performance. The stateful symbol kinds and the symbol attributes all_must_match and none_must_match present limitations on the depth-wise re-ordering of the symbols in the single-branch. Also the reset symbol attribute has an effect. At times we want the structure to continue to act like the original branching pattern tree, but at times we want the single branch behavior. The two are combined into a single-branch tree by observing ‘symbol adjacencies’. Such adjacencies force some regions of the single-branch to be in the order of a depth-first recursion of the pattern tree with no interpositions.

Stateful symbols do things like count, sum, reduce and so on that depend on the number of iterations they see during the recursive visiting. The reset symbol attribute controls the ‘scope’ of the stateful symbols, widening their scope when set false. The states are reset to the value of the from symbol attribute on each match of the nearest outer plain symbol or series symbol having the reset symbol attribute true. Within the scope, the symbols must be kept adjacent.