PatternQuery Implementation

This page might be very interesting for software engineers and those familiar with DBMS, but some of it will be interesting for anyone.

Like SQL, PatternQuery is not Turing complete, but all queries terminate eventually, so there are no infinite loops. The execution engine uses only nested loops, without necessarily using temporary space. Unlike in relational systems, the nested loops form a tree, and the organization of the tree somewhat matches the input data, so matching is close to optimal to start with, making compilation easier. However, the compiler combines patterns for speed, sharing searching, doing ‘predicate pushdown’ and much more. Optimization is rule-based and takes only milliseconds. Therefore each REST request can afford to re-compile the query that implements its API.

Temporary Memory Use

A PatternQuery cannot run out of memory except for intentional use of temporary space determined by the query definition in these cases:

  • Sort symbols, which re-structure data within a PatternQuery (Future). These are user-defined, and can have controllable space requirements by being reset during execution at controllable times by setting the depends_on or reset symbol attributes. (A ‘VolatileItemSpace’ is allocated once per sort, starting at about 10KB).
  • Request and response parameters. These are for the REST API or other queries that have external input and output. These are normally of reasonable size, and can either be communicated over the network or entered and received by the user. (A ‘VolatileItemSpace’ is allocated once for the request and once for the response).
  • Logic modes ‘and’ and ‘not’. These are for the :Zig-Zag algorithm for iterating intersections and differences while matching. The temporary space is a set of Objects that are determined by the size of the outer set in the pair of plain symbols that participate. The outer set is frequently much smaller than the inner set, so this is not always very large – often only a few objects – even though the inner sets can be of unlimited size. The zig-zag algorithm is also used for Joins and is extremely efficient, and the join processing is combined with these logic modes where possible and all computed at once. The inner plain symbol does not require extra memory.

Concurrency

Queries can overlap, do not interfere, and are not limited in degree of concurrency. The multi-core concurrency of the underlying InfinityDB BTree is fully utilized. Each query executes single-Threaded (so far!). There are no inter-query locks, or in fact any locks at all within the PatternQuery code, except that if a query is executed in the context of an Optimistic transaction, then OptimisticLockConflictExceptions can be thrown and retries may be needed. Optimistic Locks do not require special ‘version number’ fields unlike RDBMS but simply transparently ‘claim’ a set of prefixes. Once compiled, a query has an internal form that is a structure of objects called a VisitorNodeList that is single-threaded. The query may not be executed concurrently by multiple Threads as it is stateful, so each thread must compile it at least initially to get its own compiled instance with its own internal VisitorNodeList.

Expression Evaluation

Common subexpressions are factored out and values are cached where helpful. A notification tree propagates value change events, so expressions and subexpressions are not unnecessarily re-executed when referred to. This can be important when doing regex matching and text formatting in relatively outer parts of the pattern.

Example Plan

Here is a printout of the execution plan of a test query. It is not necessary to print or evaluate plans to use PatternQueries, but this may be interesting for at least the Execution iterations estimate, actual iterations and execution time. The plan may be seen in the web database browser by clicking the ‘check by compiling’ button.

The symbols in the definition are all named n1 n2 etc. and the literals are Classes or Attributes, with no expressions. The nested loops are indicated by indentation. Terms in parentheses are ‘nodes’ from the pattern, and those with arrows are from the result. After a # is the node identifier, and after a > is an identifier that is the reference to the outer node that was determined by the original pattern.

This plan is efficient because it is not just a diagonal but has many ‘outdents’ where backtracking can occur, showing the tree structure of the looping. The looping is dependent on the pattern structure, and often there is much more bushiness than shown here for high efficiency.

You can see that there has been extensive editing of the pattern, because the outer node references frequently hop over nodes in this plan. For example, literals have been moved outwards, which is called ‘predicate push down’, to increase efficiency. Join symbol nodes have been moved so they tend to follow immediately the symbol nodes that they join with, which allows a very efficient ‘zigzag’ algorithm that computes the intersection in minimal time. Joins tend to produce long diagonals, as do result cross products. A result cross product happens when a result includes symbols from different suffixes of multiple patterns. You can see that the cross product has nodes three-deep, while in the original pattern, there were two patterns, each two-deep.

Query plan
Compile time: 39.9102 msec
Execution time: 6.5167 msec
Unreferenced symbols: []
Definition:
query {
    pattern {
        Cross {
            =n10;
            =n11;
        }
        DoubleJoina =n4 a =n5;
        DoubleJoinb =n5 a =n6;
        DoubleJoinc =n5 a =n7;
        Join =n1 a =n2;
        Join2 =n2 a =n3;
    }
    result {
        Cross =n10 =n11;
        DoubleJoin =n4 =n5 {
            =n6;
            =n7;
        }
        DoubleJoinEnding =n6;
        Join =n1 =n2 =n3;
    }
}
Node list: 
(root #115 >)
    (Cross #112 >115)
        (=n11 #113 >112)
            (=n10 #114 >112)
                -> Cross =n10 =n11
    (DoubleJoinc #86 >115)
        (DoubleJoinb #87 >115)
            (DoubleJoina #88 >115)
                (=n4 #89 >88)
                    (a #90 >89)
                        (=n5 #91 >86)
                            (!join =n5 #92 >87)
                                (!join =n5 #93 >90)
                                    (a #94 >91)
                                        (a #95 >92)
                                            (=n7 #96 >94)
                                                (=n6 #97 >95)
                                                    -> DoubleJoinEnding =n6
                                                        -> DoubleJoin =n4 =n5 =n6
                                                            -> DoubleJoin =n4 =n5 =n7
    (Join2 #64 >115)
        (Join #65 >115)
            (=n1 #66 >65)
                (a #67 >66)
                    (=n2 #68 >64)
                        (!join =n2 #69 >67)
                            (a #70 >68)
                                (=n3 #71 >70)
                                    -> Join =n1 =n2 =n3

Execution iterations estimated by fanouts: 120.0
Execution iterations actual: 82
Plan print time: 6.2179

ZigZag Joins and Logic

The join algorithm very efficient and fast. It dynamically computes the intersection of the sets of values represented by the symbol nodes being joined, and it requires almost no intermediate storage. ZigZag takes advantage of the fact that all data is sorted at all times.

ZigZag computes the join in a maximum time that is determined by the smallest of any of the joined symbol node’s fanouts. Also, ZigZag is efficient when the join intersection is small, often taking only a few accesses, depending on the data. Even for large fanouts, much efficiency is gained by ‘skipping ahead’ and dynamically bypassing values in some of the value sets. Overall, performance increases can be orders of magnitude over sorting or filtering. The algorithm works well for multi-way or multi-order joins as well.

The algorithm does not make crosses disappear if they arise, as they are necessary for discovering all possible combinations of symbol values in the crossed parts, but the crossed parts are not part of the join symbol processing itself.

ZigZag joins are processed combined with the ‘and’ and ‘not’ logic operations which are determined by the ‘logic’ symbol attribute for very high efficiency. See the PatternQuery Reference for more on logic.

The ZigZag algorithm cycles through the inputs, doing a next() on each one. It compares the retrieved values with a ‘current’ value, and if different, the current value is set to the retrieved value. When all of the inputs have been scanned and the current value has not changed, then a match has been found and the inner VisitorNode is visited. For the ‘and’ and ‘not’ logic, further tests are done and some temporary space is needed.

The Logic operations are very useful in creating user-defined queries with the ability to select based on user input for three sets: an ‘any’, an ‘all’ and a ‘none’. These are ‘anded’ together. For example, in an AI image database, one can say ‘I want all images taken by any of these cameras or in any of these image sets, and all of them must also have all of these labels, but also none of them can be in the ‘temp’ or ‘trash’ categories.’

From and To are Fast

The restriction of a plain symbol to have a given minimum or maximum or both is very efficient using the from, from_exclusive, to, or to_exclusive symbol attributes. The iteration does not ‘throw away’ data that is matched at the beginning or end of its iterations, but always searches directly starting with a single initial access and hopping out immediately at the end (with a B-Tree, in log time). These may also be expressions which are dependent on the values of outer symbols. Also, if the plain symbol has an ‘equals’ symbol attribute, a single direct (log-time) access is needed, even if there are also from and to symbol attributes, and this may be an expression also. This is similar to an expression or literal in the pattern, which uses a single access once the iteration of outer symbols reaches them.

There is also a very general ‘filter’ kind of symbol, but it is not as efficient, since it always works by testing on every iteration it receives from the outer symbols, and backtracking when its expression symbol attribute evaluates to false. In general, queries do everything possible to avoid such ‘wasted’ iterations that backtrack without results, thus they avoid ‘filtering’.

Fanouts

The fanout symbol attribute takes a constant number, and it allows queries to be fine-tuned and it provides execution time estimation.

The execution time of a query depends on the number of iterations that will be needed for each symbol, and this information can be defaulted or explicitly specified in the query itself via the ‘fanout’ symbol attribute. The fanout only needs to be an estimate, and the defaults are normally sufficient, but by specifying the fanout more precisely, the compiler can do fine tuning to optimize the query. Also, given fanouts, the printed query plan will include an estimate of the actual number of iterations required in advance. This plan can be seen in the database browser’s query execution page using the ‘check by compiling’ button. Knowing the total steps can show up plans that inadvertently include complex ‘crosses’ which are loops that have become nested for some reason, so the number of iterations is the product of the fanouts.

A ‘fanout’ is the number of values of an inner loop compared to its immediate outer loop, where a ‘loop’ is either a plain symbol matching data, or else a ‘series’ symbol kind, which generates data by iterating inner nodes according to a from, to, and increment. If a plain symbol is limited by a first, last, or equals symbol attribute then it can match only once and it is not a ‘loop’. Also, the limit symbol attribute will restrict the loops. By default, symbol nodes have fanout 2, others have fanout 1, but series symbol kinds are evaluated according to their from, to, and increment symbol attributes to calculate the fanout exactly if those are constant.

Some fanouts are relatively constant, while others are hard to predict. For example, the number of pets per household might be 0.3, while the number of zipcodes per US state might be about 100. However, the number of other fanouts might be related to the data set size, and these fanouts can be set by the user to the expected actual data size. The actual numbers are not critical so long as they represent general overall ratios.

Symbols with small fanouts – in particular those less than 1 – should be computed early in the execution – near the root of the ‘visitor node list’. This can reduce iterations that do not produce results, and are effectively ‘filters’. Larger fanouts do not appreciably affect optimization.

Comparison with Relational Plans

PatternQuery plans contrast with those generated by a relational system, in which the nesting of a nested loops plan is only diagonal, determining effectively a huge logical cross product that is logically filtered at appropriate points to implement join and select. Nested loops in a relational system are therefore not always very efficient, and various other join mechanisms are often used, but the selection and ordering of the join mechanisms is very complex. In fact, finding an optimum or even acceptable relational plan is combinatorially complex. Query optimization may take substantial time due to the difficulty of finding a reasonably efficient plan, and the execution efficiency is still often low, possibly extremely so. Relational operators other than nested loops require temporary space of some sort, such as hash joins which require large amounts of memory, or merge joins which use storage. Temporary space is a resource that must be carefully managed, so it can limit concurrency. PatternQuery is fully concurrent, uses no resources, and provides immediate compilation, optimization, and execution launching.