PatternQuery Implementation

This page might be very interesting for software engineers and those familiar with DBMS, but some of it will be interesting for anyone.

Like SQL, PatternQuery is not Turing complete, but all queries terminate eventually, so there are no infinite loops. The execution engine uses only nested loops, without necessarily using temporary space. Unlike in relational systems, the nested loops form a tree, and the organization of the tree somewhat matches the input data, so matching is close to optimal to start with, making compilation easier. However, the compiler combines patterns for speed, sharing searching, doing ‘predicate pushdown’ and much more. Optimization is rule-based and takes only milliseconds. Therefore each REST request can afford to re-compile the query that implements its API.

Temporary Memory Use

A PatternQuery cannot run out of memory except for intentional use of temporary space determined by the query definition in these cases:

  • Request and response parameters. These are for the REST API or other queries that have external input and output. These are normally of reasonable size, and can either be communicated over the network or entered and received by the user. (A ‘VolatileItemSpace’ is allocated once for the request and once for the response).
  • Logic modes ‘and’ and ‘not’. These are for the :Zig-Zag algorithm for iterating joins, unions, intersections and differences while matching. For intersections and differences, temporary space is a set of Objects that are determined by the size of the outer set in the pair of plain symbols that participate. The outer set is frequently much smaller than the inner set, so this is not always very large – often only a few objects – even though the inner sets can be of unlimited size. The zig-zag algorithm is used for Joins and is extremely efficient, and the join processing is combined with these logic modes where possible and all computed at once. The inner plain symbol does not require extra memory.

Concurrency

Queries can overlap, do not interfere, and are not limited in degree of concurrency. The multi-core concurrency of the underlying InfinityDB BTree is fully utilized. Each query executes single-Threaded (so far!). There are no inter-query locks, or in fact any locks at all within the PatternQuery code, except that if a query is executed in the context of an Optimistic transaction, then OptimisticLockConflictExceptions can be thrown and retries may be needed, which are handled internally and invisibly. Optimistic Locks do not require special ‘version number’ fields unlike RDBMS but simply transparently ‘claim’ a set of prefixes. Once compiled, a query has an internal form that is a structure of objects called a VisitorNodeList that is single-threaded. The compiled query may not be executed concurrently by multiple Threads as it is stateful, so each thread must compile it at least initially to get its own compiled instance with its own internal VisitorNodeList.

Example Plan

Here is a printout of the execution plan of a test query. It is not necessary to print or evaluate plans to use PatternQueries, but this may be interesting for at least the Execution iterations estimate, actual iterations and execution time. The plan may be seen in the web database browser by clicking the ‘check by compiling’ button.

The symbols in the definition are all named n1 n2 etc. and the literals are Classes or Attributes, with no expressions. The nested loops are indicated by indentation. Terms in parentheses are ‘nodes’ from the pattern, and those with arrows are from the result. After a # is the node identifier, and after a > is an identifier that is the reference to the outer node that was determined by the original pattern.

This plan is efficient because it is not just a diagonal but has many ‘outdents’ where backtracking can occur, showing the tree structure of the looping. The looping is dependent on the pattern structure, and often there is much more bushiness than shown here for high efficiency.

You can see that there has been extensive editing of the pattern, because the outer node references (such as from #114 to >112) frequently hop over nodes in this plan. For example, literals have been moved outwards, which is called ‘predicate push down’, to increase efficiency. Join symbol nodes have been moved so they tend to follow immediately the symbol nodes that they join with, which allows a very efficient ‘zigzag’ algorithm that computes the intersection in minimal time. Joins tend to produce long diagonals, as do result cross products. A result cross product happens when a result includes symbols from different suffixes of multiple patterns. You can see that the cross product has nodes three-deep, while in the original pattern, there were two patterns, each two-deep.

Query plan
Compile time: 39.9102 msec
Execution time: 6.5167 msec
Unreferenced symbols: []
Definition:
query {
    pattern {
        Cross {
            =n10;
            =n11;
        }
        DoubleJoina =n4 a =n5;
        DoubleJoinb =n5 a =n6;
        DoubleJoinc =n5 a =n7;
        Join =n1 a =n2;
        Join2 =n2 a =n3;
    }
    result {
        Cross =n10 =n11;
        DoubleJoin =n4 =n5 {
            =n6;
            =n7;
        }
        DoubleJoinEnding =n6;
        Join =n1 =n2 =n3;
    }
}
Node list: 
(root #115 >)
    (Cross #112 >115)
        (=n11 #113 >112)
            (=n10 #114 >112)
                -> Cross =n10 =n11
    (DoubleJoinc #86 >115)
        (DoubleJoinb #87 >115)
            (DoubleJoina #88 >115)
                (=n4 #89 >88)
                    (a #90 >89)
                        (=n5 #91 >86)
                            (!join =n5 #92 >87)
                                (!join =n5 #93 >90)
                                    (a #94 >91)
                                        (a #95 >92)
                                            (=n7 #96 >94)
                                                (=n6 #97 >95)
                                                    -> DoubleJoinEnding =n6
                                                        -> DoubleJoin =n4 =n5 =n6
                                                            -> DoubleJoin =n4 =n5 =n7
    (Join2 #64 >115)
        (Join #65 >115)
            (=n1 #66 >65)
                (a #67 >66)
                    (=n2 #68 >64)
                        (!join =n2 #69 >67)
                            (a #70 >68)
                                (=n3 #71 >70)
                                    -> Join =n1 =n2 =n3

Execution iterations estimated by fanouts: 120.0
Execution iterations actual: 82
Plan print time: 6.2179

ZigZag Joins and Logic

The join algorithm very efficient and fast. It dynamically computes the intersection of the sets of values represented by the symbol nodes being joined, and it requires almost no intermediate storage. ZigZag takes advantage of the fact that all data is sorted at all times.

ZigZag computes the join in a maximum time that is determined by the smallest of any of the joined symbol node’s fanouts. Also, ZigZag is efficient when the join intersection is small, often taking only a few accesses, depending on the data. Even for large fanouts, much efficiency is gained by ‘skipping ahead’ and dynamically bypassing values in some of the value sets. Overall, performance increases can be orders of magnitude over sorting or filtering. The algorithm works well for multi-way or multi-order joins as well.

The algorithm does not make crosses disappear if they arise, as they are necessary for discovering all possible combinations of symbol values in the crossed parts, but the crossed parts are not part of the join symbol processing itself.

ZigZag joins are processed combined with the ‘and’ and ‘not’ logic operations which are determined by the ‘logic’ symbol attribute for very high efficiency. See the PatternQuery Reference for more on logic.

The ZigZag algorithm cycles through the inputs, doing a next() on each one. It compares the retrieved values with a ‘current’ value, and if different, the current value is set to the retrieved value. When all of the inputs have been scanned and the current value has not changed, then a match has been found and the inner VisitorNode is visited. For the ‘and’ and ‘not’ logic, further tests are done and some temporary space is needed.

The Logic operations are very useful in creating user-defined queries with the ability to select based on user input for three sets: an ‘any’, an ‘all’ and a ‘none’. These are ‘anded’ together. For example, in an AI image database, one can say ‘I want all images taken by any of these cameras or in any of these image sets, and all of them must also have all of these labels, but also none of them can be in the ‘temp’ or ‘trash’ categories.’

From, To and Equals are Fast

The restriction of a plain symbol to have a given minimum or maximum or both is very efficient using the from, from_exclusive, to, or to_exclusive symbol attributes. The iteration does not ‘throw away’ data that is matched at the beginning or end of its iterations, but always searches directly starting with a single initial access and hopping out immediately at the end (with a B-Tree, in log time). These may also be expressions which are dependent on the values of outer symbols. These are efficient even with symbols that create joins by occurring multiple times in the pattern.

Also, if the plain symbol has an ‘equals’ symbol attribute, a single direct (log-time) access is needed, even if there are also from and to symbol attributes or joins, and this may be an expression also.

Inline is Fast

An expression or literal inline in the pattern uses a single access once the iteration of outer symbols reaches them. The expression is evaluated given the values bound in the outer symbols, and then a single match on it is performed, with no filtering. This is also log-time (a single the B-Tree search). The alternative, a filter at the next level deeper, would have to iterate all of the values.

Filtering and Backtracking

There is also a very general ‘filter’ kind of symbol, but it is not as efficient, since it always works by testing on every iteration it receives from the outer symbols, and backtracking when its ‘expression’ symbol attribute evaluates to false. In general, queries do everything possible to avoid such ‘wasted’ iterations that backtrack without results, thus they avoid ‘filtering’.

Backtracking is possible to invoke explicitly, however. The ‘required’ symbol attribute causes backtracking if no matches occur at that symbol (it is possible to move forwards i.e. inwards sometimes even with no match). Also, the ‘all_must_match’ and ‘none_must_match’ symbol attributes do a ‘pre-pass’ to determine if there are any matches all the way inwards (that would cause a result to fire), backtracking if required. The pre-pass is efficient, just looking for any one result that would fire or any one result that would not fire. The pre-pass may take as long as a complete match test, though, if unlucky. These may be nested.

Fanouts

The fanout symbol attribute takes a constant number, and it allows queries to be fine-tuned and it provides execution time estimation.

The execution time of a query depends on the number of iterations that will be needed for each symbol, and this information can be defaulted or explicitly specified in the query itself via the ‘fanout’ symbol attribute. The fanout only needs to be an estimate, and the defaults are normally sufficient, but by specifying the fanout more precisely, the compiler can do fine tuning to optimize the query. Also, given fanouts, the printed query plan will include an estimate of the actual number of iterations required in advance. This plan can be seen in the database browser’s query execution page using the ‘check by compiling’ button. Knowing the total steps can show up plans that inadvertently include complex ‘crosses’ which are loops that have become nested for some reason, so the number of iterations is the product of the fanouts.

A ‘fanout’ is the number of values of an inner loop compared to its immediate outer loop, where a ‘loop’ is either a plain symbol matching data, or else a ‘series’ symbol kind, which generates data by iterating inner nodes according to a from, to, and increment. If a plain symbol is limited by a first, last, or equals symbol attribute then it can match only once and it is not a ‘loop’. Also, the limit symbol attribute will restrict the loops to a given number and skip the rest. By default, symbol nodes have fanout about 4, others have fanout about 1, but series symbol kinds are evaluated according to their from, to, and increment symbol attributes to calculate the fanout exactly if those are constant. The default fanouts are rough approximations and not specified here.

Some fanouts are relatively constant, while others are hard to predict. For example, the number of pets per household might be 0.3, while the number of zipcodes per US state might be about 100. However, the number of other fanouts might be related to the data set size, and these fanouts can be set by the user to the expected actual data size. The actual numbers are not critical so long as they represent general overall ratios. In fact the ratios are not important either – only the ordering from low to high.

Symbols with small fanouts – in particular those less than 1 – should be computed early in the execution – near the root of the ‘visitor node list’. This can reduce iterations that do not produce results, and are effectively ‘filters’. Larger fanouts do not appreciably affect optimization.

Comparison with Relational Plans

PatternQuery plans contrast with those generated by a relational system, in which the nesting of a nested loops plan is only diagonal or ‘left-deep’, determining effectively a huge logical cross product that is logically filtered at appropriate points to implement join and select. Indexes are necessary on various loops if available to prevent multiplicatively increasing execution times.

Nested loops in a relational system are therefore not always very efficient, and various other join mechanisms are preferred, but the selection and ordering of the join mechanisms is very complex. In fact, finding an optimum or even acceptable relational plan is combinatorially complex. Query optimization may take substantial time due to the difficulty of finding a reasonably efficient plan, and the execution efficiency is still often low, possibly extremely so. Every relational system implementation has a practical ‘Join order limit’ of, for example 5 to 10 tables depending on the query, beyond which compilation and execution become too slow.

PatternQueries have no substantial join count limits, and in any case many joins are not even necessary, because each case of a nesting is like a ‘pre-computed join’ or ‘materialized join’.

Relational operators other than nested loops require temporary space of some sort, such as hash joins which require large amounts of memory, or merge joins which use storage. Temporary space is a resource that must be carefully managed, so it can limit concurrency.

PatternQuery is fully concurrent, uses no resources, and provides immediate compilation, optimization, and execution launching. PatternQuery also allows ‘replication’ of data in various forms resembling relational indexes and inversions, but also custom creative structures for even better performance. The choice of such replications for joins is made by the PatternQuery author in advance, speeding compilation and providing an opportunity for execution optimization. PatternQueries are named and stored in the database for later use already optimized, in contrast to relational systems, which must usually perform the entire optimization on each query. Relational systems generally do provide some form of ‘prepared statement’ but mostly queries are cobbled up by code by creating SQL statements as strings on each execution.