PatternQuery Implementation

This page is really for software engineers and those familiar with DBMS, but perhaps a scan of it will be interesting for anyone.

Like SQL, PatternQuery is not Turing complete, but all queries terminate eventually, so there are no infinite loops. A PatternQuery cannot run out of memory except for intentional use of temporary space, such as the temporary root map in a Stepped query, or the response header or response content in a REST client query . The execution engine uses only nested loops, without temporary space, but a stepped query can explicitly use temporary space or modify the database and then input the modifications. Unlike relational, the nested loops form a tree. Optimization is rule-based and almost immediate in all cases. Compilation takes only 10’s of microseconds.

Concurrency

Queries can overlap, do not interfere, and are not limited in degree of concurrency. The multi-core concurrency of the underlying InfinityDB BTree is fully utilized. Each query executes single-Threaded (so far!). There are no inter-query locks, or in fact any locks at all within the PatternQuery code, except that if a query is executed in the context of an Optimistic transaction, then OptimisticLockConflictExceptions can be thrown and retries may be needed, as with any other code. Once compiled, a query has an internal form that is a structure of objects called a VisitorNodeList. The query may not be executed concurrently by multiple Threads as it is stateful, so each thread must re-compile to get its own compiled instance with its own internal VisitorNodeList.

Expression Evaluation

Common subexpressions are factored out and values are cached where helpful. A notification tree propagates value change events, so expressions and subexpressions are not unnecessarily re-executed when referred to. This can be important when doing regex matching and text formatting in relatively outer parts of the pattern.

Example Plan

Here is a printout of the execution plan of a particular test query called a VisitorNodeList. The symbols in the test definition are all named n1 n2 etc. and the literals are Classes or Attributes, with no expressions. The nested loops are indicated by indentation. Terms in parentheses are ‘nodes’ from the pattern, and those with asterisks are from the result. After a # is the node identifier, and after a > is an identifier that is the reference to the outer node that was determined by the original pattern.

This plan is efficient because it is not just a diagonal but has many ‘outdents’ where backtracking can occur, showing the tree structure of the looping. The looping is dependent on the pattern structure, and often there is much more bushiness than shown here for high efficiency.

You can see that there has been extensive editing of the pattern, because the outer node references frequently hop over nodes in this plan. For example, literals have been moved outwards, which is called ‘predicate push down’, to increase efficiency. There are ‘dup’ symbol nodes, which are related to joins. Dup symbol nodes have been moved so they tend to follow immediately the symbol nodes that they join with, which allows a very efficient ‘zigzag’ algorithm that computes the intersection in minimal time. Joins tend to produce long diagonals, as do result cross products. A result cross product happens when a result includes symbols from different suffixes of multiple patterns. You can see that the cross product has nodes three-deep, while in the original pattern, there were two patterns, each two-deep.

(root #112 >)
    (DoubleJoina #83 >112)
        (=n4 #84 >83)
            (a #85 >84)
                (DoubleJoinb #86 >112)
                    (DoubleJoinc #87 >112)
                        (=n5 #88 >85)
                            (dup =n5 #89 >86)
                                (dup =n5 #90 >87)
                                    (a #91 >90)
                                        (=n7 #92 >91)
                                            (a #93 >89)
                                                (=n6 #94 >93)
                                                    * DoubleJoinEnding =n6
                                                        * DoubleJoin =n4 =n5 =n6
                                                            * DoubleJoin =n4 =n5 =n7
    (Cross #109 >112)
        (=n11 #110 >109)
            (=n10 #111 >109)
                * Cross =n10 =n11
    (Join #62 >112)
        (=n1 #63 >62)
            (a #64 >63)
                (=n2 #65 >64)
                    (dup =n2 #66 >62)
                        (a #67 >66)
                            (=n3 #68 >67)
                                * Join =n1 =n2 =n3

This was compiled from this test query showing some very advanced cases:

query {
    pattern {
        Join =n1 a =n2;
        Join =n2 a =n3;
        DoubleJoina =n4 a =n5;
        DoubleJoinb =n5 a =n6;
        DoubleJoinc =n5 a =n7;
        Cross =n10;
        Cross =n11;
     }
     result {
         Join =n1 =n2 =n3;
         DoubleJoin =n4 =n5 =n6;
         DoubleJoin =n4 =n5 =n7;
         DoubleJoinEnding =n6;
         Cross =n10 =n11;
      }
 }

ZigZag Joins

The join algorithm is unlike relational joins. It dynamically computes the intersection of the sets of values represented by the symbol nodes being joined, and it requires no intermediate storage. ZigZag takes advantage of the fact that all data is sorted at all times.

A slower alternative join algorithm would iterate all of the values of the outer symbol node, and then filter according to the values matching the dup symbol nodes in inner nodes. The filtering requires one constant-time access for each inner symbol node for each outer value. The time for that algorithm is proportional to the size of the outer symbols’ value set. That size is not known at compile time because it is data dependent, and it is equally likely that the outer symbol’s set will be larger or smaller than the inner symbol node or symbol nodes’ sets.

ZigZag, however computes the intersection in a maximum time that is the smaller of any of the joined symbol node’s value sets. Also, ZigZag is efficient when the intersection is small, often taking only a few accesses, depending on the data. Even for large sets, much efficiency is gained by ‘skipping ahead’ and dynamically bypassing values in some of the sets. Overall, performance increases can be orders of magnitude.

Relational Plans

PatternQuery plans contrast with those generated by a relational system, in which the nesting of a nested loops plan is only diagonal, determining effectively a huge logical cross product that is logically filtered at appropriate points to implement join and select. Nested loops in a relational system are therefore not always very efficient, and various other join mechanisms are often used, but the selection and ordering of the join mechanisms is very complex. In fact, finding an optimum or even acceptable relational plan is combinatorially complex. Query optimization may take substantial time due to the difficulty of finding a reasonably efficient plan, and the execution efficiency is still often low, possibly extremely so. Relational operators other than nested loops require temporary space of some sort, such as hash joins which require large amounts of memory, or merge joins which use storage. Temporary space is a resource that must be carefully managed, so it can limit concurrency. PatternQuery is fully concurrent, uses no resources, and provides immediate compilation, optimization, and execution launching.