PatternQuery Examples

Fish Farm

Here is a typical kind of use, for a database of fish farms and the fish prices over time. Notice the join on =fish_species to relate the farm to the prices over time. The =fish_species symbol occurs both in the fish_species attribute of the FishFarm class and in the species attribute of the FishMarket class. The result is just rows of data under the FishFarmProfit class that could be considered a series of tuples, such as in a CSV file, or spreadsheet. The result doesn’t have to be a single ‘CSV’ set of rows but can be arbitrarily complex, even creating multiple ‘outputs’ or structures. The =profit symbol takes data from the market price, the cost of the fish to each farm, and the number of fish of that species at the farm. This query might be kept available in the database itself for future re-use, as a set of Items just like any other data, say under Query "fish farm profits" query ....

This query is matching a symbol we don’t need – location – but this allows us to change the result easily without having to add or remove extra symbols, without affecting the query or its performance. Unused symbols in the pattern are ‘wildcards’ and either have no effect or can be meaningful as placeholders.

query {
    Where 'profit' equals ((price - fish_cost) * fish_count);
    pattern {
        FishFarm =farm_name {
            fish_cost =fish_cost;
            fish_count =fish_count;
            fish_species =fish_species;
            location =location;
        }
        FishMarket date =market_date {
            price =price;
            species =fish_species;
        }
    }
    result FishFarmProfit =farm_name =market_date =profit;
}

This query will add any new data to the FishFarmProfit ‘table’ each time it is run, rather than clearing it. That is not a problem unless data is added or modified. To clear the entire table, we would add a result FishFarmProfit =all with Where 'all' action 'clear'. That would run in an earlier ‘clear’ execution pass to clear the table, then the main pass would fill in the result.

Here is the data in the database:

{
    FishFarm {
        'santa cruz fish' {
            fish_cost 2.18;
            fish_count 30000;
            fish_species 'tilapia';
            location 'soquel CA';
        }
        'soquel fish' {
            fish_cost 5.03;
            fish_count 5000;
            fish_species 'tilapia';
            location 'soquel CA';
        }
    }
    FishMarket date {
        2021-10T-07:00 {
            price 10.01;
            species 'tilapia';
        }
        2021-10-2T-07:00 {
            price 10.5;
            species 'tilapia';
        }
    }
}

The profitabililty of the various fish farms is below. This is displayed in i code form, so the farm name is factored out as a common prefix. Dates are by default always in the ISO date format, although the time part is omitted when it is 00:00:00..

FishFarmProfit {
    'santa cruz fish' {
        2021-10T-07:00 234900.0;
        2021-10-2T-07:00 249600.0;
    }
    'soquel fish' {
        2021-10T-07:00 24899.999999999996;
        2021-10-2T-07:00 27350.0;
    }
}

Downloading Data

Here is the same result as Items, which looks a lot like CSV. but is space delimited. In the server’s backend database browser, there is a ‘Show Items’ view mode that shows this, and the FishFarmProfit prefix can be factored out there automatically by selecting it in the ‘current prefix’. Then you can copy and paste this or part of it into Excel and convert text to columns using a space delimiter.

You can apply the simple_date_format(‘MM/dd/yy HH:mm:ss ZZ’) operator to a string, to convert it to a date or the reverse. Excel has a function to convert dates to ISO which is =TEXT(a1,”yyyy-mm-ddThh:MM:ss”). There is also a =DATE(…) function for the reverse. Unfortunately, Excel dates do not have time zones and are only local, not points in time. InfinityDB dates are internally standard longs representing milliseconds since the epoch. In order to have the query output dates for Excel, use the date formatting function with the appropriate format for your locale: just put in (market_date.simple_date_format(‘MM/dd/yy HH:mm:ss ZZ’)) instead of =market_date in the result, and you get (but you may want to delete the ZZ first):

"santa cruz fish" "10/01/21 00:00:00 -0700" 234900.0
"santa cruz fish" "10/02/21 00:00:00 -0700" 249600.0
"soquel fish" "10/01/21 00:00:00 -0700" 24899.999999999996
"soquel fish" "10/02/21 00:00:00 -0700" 27350.0

Adding Structure to Fish Farm

Data can be added in many ways without affecting the fish farm query. For example, Attributes can be added to the FishFarm class, such as food_usage on all or some of the farms. The entire FishFarms class does not need to be altered to change the schema. It is easy to add data structures beneath existing data, such as a fish subspecies by plugging it in under fish_species – you just store fish_species 'tilapia' 'some subspecies';. The further data can be any kind of structure, without affecting the query because the =fish_species is matching on isolated ‘primitive’ data types by default, which ignores nested data. A query to capture all nested data as well would do Where 'fish_species' type 'item'; where an item is zero or more of any data types. Or to capture just the species and subspecies, use two symbols together: FishFarm =farm_name fish_species =fish_species =fish_subspecies;. Another possibility is to allow the species to expand in the future as a tuple with Where 'fish_species' type 'tuple'; Then, a fish species is a tuple, meaning it is any series of zero or more primitives delimited by any ‘meta’s inner to that.

CSV mode

There is a ‘Show CSV’ view mode in the backend database browser which is compatible with Excel and other standard data sources. In this case, the lines are prefixed by increasing Index data types, so that one has a list. If you select Show CSV first, then paste in some CSV and store, these Index numbers are generated automatically, and the result is something like a ‘document’ so the line order is preserved. Then this can be parsed using a custom query into anything needed, including doing the date reformatting. These documents are shown as lists in the other display modes or formats, such as JSON or i Code. (Note that storing such a CSV document on a given prefix in the Show CSV view will wipe out anything on that prefix, as usual, but that other previous content will not be visible in CSV mode, which only shows the Index-delimited lines – this may change.) Python or JavaScript working with data arriving or leaving through REST client queries or via direct REST access can modify and read the database in a clean, structured way, but currently cannot parse and format CSV. Java, however, can use CsvParser and CsvFormatter, to complement IParser, IFormatter, JsonParser, JsonFormatter, FormattedItemParser and FormattedItemFormatter.

Formatting

The CSV mode tries to parse fields into strings, numbers, and ISO dates. If a field is not parsable as a number or date, it becomes a string. If you want a number or date to come out as a string component instead, precede it with an underscore. If you want to put in any other component types, put in a valid token, such as EntityClasses, Attributes, char[] as Chars(“..”), byte[] as Bytes(10_20), byte strings as ByteString(10_20), Indexes as [10], or booleans as true or false and so on, precede them with underscore, but if you put an underscore in front of a field that is not a valid isolated token, it will stay a string.

CSV mode preserves as much of the original text format as it can, so you can store a CSV file and then retrieve it in a format that is still compatible externally (spaces are deleted around numbers and dates for example). Line delimiters are flexible on input, but become the platform-dependent terminator on output. CSV quotes commas and double quotes within fields by surrounding the field with double quotes, but these double quotes are not considered part of the original data. An original double quote within a field is doubled. Therefore, double double quotes quote double quotes.

Extraction, Transformation, and Loading

For uploading large amounts of data, consider breaking it into multiple CSV documents, such as by prefixing each document with some other component type, like a long or another Index or a string ‘name’ or file name. This extra breakdown can then be handled in queries by putting a symbol in the place of the extra breakdown component, followed by a symbol for the line number, which is an Index data type. A query can renumber the lines later into a huge CSV document if desired or just use ‘wildcard symbols’ matching the breakdown and line number components to make them be ignored. The fields within the lines can be extracted, parsed, cleansed, transformed, and transferred anywhere else or the reverse with a suitable query, so that they become more understandable, flexible, and extensible, in a semantically correct schema. A single query can do this for any subset of the fields at once, and it can send the data to multiple destinations or retrieve it and assemble one or more CSV documents in elegant ways. The fields on input have some implied significance based on context including the field position each line, but the query can store them in ways that makes their significance clear and standardized by adding meta components (EntityClass i.e. class and Attribute components) into the stored Items in some way. The =fish_species

MetaData

With some extra custom structure, the file name and other ‘metadata’ of each CSV document can be associated with it. It may be necessary manually to delete the first line, in case that appears to be column headings, but then of course when reproducing the file for output, it will be absent. The column headings can be moved to some metadata associated with the CSV document or just become implicit in the transformed structure, since often the CSV document will be deleted after use. The query can automatically omit the first line if desired, such as with a simple filter kind of symbol:

query {
    pattern MyCsv =csv_file_name =csv_line =not_first_line =first_column =second_column;
...
    Where {
        'not_first_line' {
            kind 'filter';
            equals ((long)csv_line > 0);
        }
       ...
    }
}

Similar to BLOB,CLOB

CSV format resembles that for Binary Long Objects or Character Long Objects, i.e. BLOBs and CLOBs. These are also stored with increasing Indexes, but after those Indexes come 1024-element byte[] or char[] components, which you can see token-formatted as Bytes(33_15_88) or Chars(“…”). You can store any raw data whatsoever with BLOBs and CLOBs, plus there is a way to associate a mime type with them in the REST access, which allows them to be displayed in the ‘Show Table’ mode or web browser as images, for example. Java can work with these through the BinaryLongObjectOutputStream and so on. These will work on any ItemSpace and on any prefix, where the prefix is effectively the name.

Generate a Timeline Graph

Here we take several time series and merge them into a single time series for display on a line chart. The two time series are sets of (time1, y1) and (time2, y2) tuples, and the chart needs to see (sample_number, time, y1, y2). We use a series symbol kind to create the sample slots as a set of linearly increasing times by 12:00 hours, and then we can get the y1 and y2 at the end (or start) of each sample slot to combine them. Here we have the data and other constants in the query itself but these can all be brought in with expressions from the database or elsewhere. The WithData section of the query has some constant data for the input for demonstration.

The pattern contains the series symbol, which generates iterations and represents a loop, followed by a counter, which just increments by one for each iteration of the series. Only plain symbols and series kinds of symbol can generate loop iterations.

The =wd symbol is a ‘root map’ of kind ‘with data’ so it is at the root of the pattern tree, and the patterns inner to it match the Items under WithData. This root map could be changed to get data from anywhere, and by default i.e. without the =wd symbol, the data comes from the input database.

The symbols =x1 and =x2 have symbol attributes last true and to =series so they each take from the input data the last time from the series within a sample interval. Alternatively, we could use the first time in each sample interval.

We convert the times to long milliseconds and back so we can do arithmetic on them (dates and times will have math soon).

{
    description 'the withdata would normally be variable';
    query {
        Where {
            'counter' kind 'counter';
            'series' {
                equals ((date)series);
                from ((long)2021-10-8);
                increment ((long)12:00:00);
                kind 'series';
                to ((long)2021-10-12);
            }
            'wd' kind 'with data';
            'x1' {
                last true;
                to =series;
            }
            'x2' {
                last true;
                to =series;
            }
        }
        WithData {
            Graph1 {
                2021-10-8T-07:00 38;
                2021-10-10T-07:00 42;
                2021-10-12T-07:00 8;
            }
            Graph2 {
                2021-10-9T-07:00 77;
                2021-10-11T-07:00 98;
                2021-10-13T-07:00 102;
            }
        }
        pattern {
            =series =counter;
            =wd {
                Graph1 =x1 =y1;
                Graph2 =x2 =y2;
            }
        }
        result =counter =y1 =y2 =series;
    }
}

The result shows increasing counts at the left, and then the value of y1, y2, and the sample time, which increases by 12 hours for each sample. This could be sent via a client query to a web page (using AJAX for example), where the JavaScript uses a graphing library to draw a line chart. Or, the InfinityDB Python library can read it.

0 38 77 2021-10-9T-07:00
1 38 77 2021-10-9T12-07:00
2 42 77 2021-10-10T-07:00
3 42 77 2021-10-10T12-07:00
4 42 98 2021-10-11T-07:00
5 42 98 2021-10-11T12-07:00
6 8 98 2021-10-12T-07:00

Client Query to Find a Next Thing

Access a remote InfinityDB through the REST interface, and ask for the next thing in a sequence of things that are in the database. This is the definition of a simple API that can be accessed via the URL for that query. The input here is from the ‘request content’ which is identified in the pattern using the symbol =in. The output is to the ‘response content’ which is identified in the result with the symbol =out. The work is done in the from =current_thing;and the increment 1;. The increment can be negative as well. There is only one thing returned at a time, limited by the length 1; although it could be larger. Much more elaborate APIs can easily be defined, with more complex parameters and results, and larger and more complex kinds of things.

query {
    Where {
        'in' kind 'request content';
        'next_thing' {
            from =current_thing;
            increment 1;
            length 1;
        }
        'out' kind 'response content';
    }
    pattern {
        Things =next_thing;
        =in current_thing =current_thing;
    }
    result {
        =out next_thing =next_thing;
    }
}

The database has this state, and this is where the ‘things’ are stored. We are asking for the thing after thing 2, and it returns ‘thing 3’.

Things "thing 1"
Things "thing 2"
Things "thing 3"

The output will go over the wire as JSON, and be decoded by the client. The Python client uses the infinitydb.access package which returns a dict, and a JavaScript client uses AJAX and gets an Object. Notice the underscore before “_next_thing”. That indicates to the other end that this is an InfinityDB Attribute, not a literal string, which would not have the underscore. By putting in these underscores, i.e. ‘underscore quoting’ we can encode any InfinityDB data type into a JSON key, which has to have the double quotes. For example, sending a long as a key would look like “_5”. To send a literal string already starting with underscores, the code transparently adds another underscore and removes it. Here is the JSON actually returned to the client :

{
    "_next_thing" : "thing 3"
}

If the things were to be more complex, this will return only their first primitive components, which are probably something like their ‘ids’. The reason is that by default, a symbol matches a single primitive value (remember a primitive is any component but EntityClass or Attribute). However, it is possible to return the entire thing by setting 'next_thing' type 'item'which means each match of next_thing includes all of the rest of each Item. Also, 'next_thing' type 'number' could be used to select only the numeric ids, and so on. The possible types include any of the 12 basic types, or ‘tuple’ or ‘item’. The basic types can be a set, not just a single possibility, so 'next_thing' type { 'double'; 'float'; } could be used. Also ‘number’, ‘component’, ‘primitive’ and ‘meta’ are shorthands.

Fibonacci Reducer

Here is a reducer that creates the first Fibonacci numbers. We use a ‘series’ kind of symbol at the left or ‘outside loop’ of the pattern to create integers, but we don’t use those numbers in the reducer – they are just for providing a series of iterations to drive the reducer. A PatternQuery is analogous to a set of nested loops, where each plain symbol represents a loop surrounding the ‘inner’ or right part of the pattern, and each series symbol also represents a loop, generating increasing integers but without accessing the database. Only the plain symbol and series symbol represent loops.

query {
    pattern =n =fib;
    result  =n ('fib(%s)=%s'.format(n,(head)fib));
    Where  {
        'n' {
            kind 'series';
            to 10;
        }
        'fib' {
            from ( (0,1,1) );
            kind 'reducer';
            equals ( ((tail)fib, fib.get(1) + fib.get(2)) );
        }
    }
}

The state of the reducer is a three-element item, which is a tuple containing fib(n) fib(n+1) and fib(n+2). The ‘from’ attribute initializes this to the tuple (0,1,1). The from attribute is an expression because it is in parentheses, and the tuple construction requires a further pair of parentheses. Tuples are indicated by having commas. In the equals attribute of the ‘fib’ symbol, the definition of the next value involves a self-reference to the fib symbol, and the tuple for the next iteration is constructed. This produces:

0 "fib(0)=0"
1 "fib(1)=1"
2 "fib(2)=1"
3 "fib(3)=2"
4 "fib(4)=3"
5 "fib(5)=5"
6 "fib(6)=8"
7 "fib(7)=13"
8 "fib(8)=21"
9 "fib(9)=34"
10 "fib(10)=55"

Running Weighted Average Reducer

Here is a more complex reducer that prints the weighted average of the last n values of a series of generated random numbers. It shows some of the real power of reducers. The last n values are kept in an item, which ‘rolls forward’ by means of an item constructor ( ((tail)last_n, random) ). The weighting is done by multiplying two items, last_n and weights_normalized, which is element-by-element. Then the predefined (sum) cast totals that. The (weights_normalized * 0) just creates the initial Item of zeroes because a component times an item distributes the operation over the item elements. The ‘weights’ symbol constructs an item from a comma-separated list or ‘item constructor’, which can contain any expressions and is automatically flattened. The weights,, input and output could be elsewhere, such as in the database.

query {
    pattern =n =random =last_n;
    result  =n ('n=%d random=%2d avg=%.3f'.format(n, random, avg));
    Where  {
        'n' {
            kind 'series';
            to 10;
            comment 'generate series of numbers';
        }
        'random' {
            kind 'random';
            to_exclusive 100;
            depends_on 'n';
            comment 'generate arbitrary input. random is long';
        }
        'last_n' {
            from (weights_normalized * 0);
            kind 'reducer';
            equals ( ((tail)last_n, random) );
            comment 'the last n values as a tuple';
        }
        'weights' equals ((1.0, 2.0, 3.0, 5.0, 6.0));
        'weights_normalized' equals (weights/(sum)weights);
        'avg' equals ((sum)(last_n * weights_normalized));
    }
}

The output of one run was:

0 "n=0 random=47 avg=0.000"
1 "n=1 random= 8 avg=2.824"
2 "n=2 random=64 avg=24.941"
3 "n=3 random=17 avg=26.235"
4 "n=4 random=75 avg=43.706"
5 "n=5 random=51 avg=51.059"
6 "n=6 random=52 avg=52.353"
7 "n=7 random=44 avg=49.647"
8 "n=8 random=58 avg=53.000"
9 "n=9 random=63 avg=56.176"
10 "n=10 random= 1 avg=37.353"

Filtering

Filter out some loop iterations – here the ones with counts not divisible by 3. The equals symbol attribute of the filter kind of symbol evaluates to a boolean that determines whether the inner parts of the pattern are visited. The series kind symbol just iterates from 0 to 10. The parentheses on the equals symbol attribute indicate that it is an expression. The iterations of the =series symbol are not prevented, but are ‘ignored’, so the execution time is the same as without the filter, except that the time writing to the output is reduced. The idea of throwing away or ignoring iterations can be called ‘filtering’ whereas the plain symbol attributes from, from_exclusive, to, to_exclusive, first, last, equals, and length do reduce the total iterations, and are very efficient, doing only direct accesses (B-Tree accesses at log(n) speed).

Extra iterations that are thrown away are avoided whenever possible. However, the part of the pattern inner to a plain symbol must match in order for the pattern to match overall, so if the inner part fails to match for some reason, then the plain symbol must go to the next iteration, and a filtering occurs. In this case the execution ‘backtracks’. For example, a literal following a plain symbol will match only if the corresponding database data equals the literal’s value, so a literal is a kind of predicate or filter. Similarly, an expression in the pattern will match only if the corresponding database data equals the evaluated expression. The compiler tries to re-organize the patterns internally to move literals and expressions outward. Note that when a literal, expression, or filter does match, only one database access is required to continue the matching inwards, and the inner parts of the pattern are similarly efficient.

query {
    Where {
        'filter' {
            equals (series % 3 == 0);
            kind 'filter';
        }
        'series' {
            kind 'series';
            to 10;
        }
    }
    pattern =series =filter;
    result =series;
}

The output is below, formatted as i code:

{
    0;
    3;
    6;
    9;
}

Generating Lists

Sometimes the output should be in the form of a list, such as when data is being generated for a client query, and the result will be read by Python or JavaScript. This requires only casting an increasing long to an index type. Index is a basic datatype, which is really just a special kind of long. (When encoded in a Cu it is really just a long with a special one-char code preceding it.) Here we generate a sequence of longs in a series kind of symbol, but in the result we have an expression that casts the count to an index. Lists can also be specified in i code in a literal form, like [‘a’, ‘b’].

query {
    Where 'indexes' {
        kind 'series';
        to 10;
    }
    pattern =indexes;
    result ((index)indexes) 'hello' =indexes;
}

The output, in i code form is:

[
    'hello' 0,
    'hello' 1,
    'hello' 2,
    'hello' 3,
    'hello' 4,
    'hello' 5,
    'hello' 6,
    'hello' 7,
    'hello' 8,
    'hello' 9,
    'hello' 10
]

In tokenized form, you can see the index components there, which look like [n].

[0] "hello" 0
[1] "hello" 1
[2] "hello" 2
[3] "hello" 3
[4] "hello" 4
[5] "hello" 5
[6] "hello" 6
[7] "hello" 7
[8] "hello" 8
[9] "hello" 9
[10] "hello" 10

In JSON form, it is a list, but it is rather verbose. Nevertheless, this is how it goes over the wire in a REST access. Note that if the numbers had been doubles rather than longs, they would not look like “_0” but like 0.0. This preserves the InfinityDB data typing as the data passes into and out of JSON form.

[
    {
        "hello" : "_0"
    },
    {
        "hello" : "_1"
    },
    {
        "hello" : "_2"
    },
    {
        "hello" : "_3"
    },
    {
        "hello" : "_4"
    },
    {
        "hello" : "_5"
    },
    {
        "hello" : "_6"
    },
    {
        "hello" : "_7"
    },
    {
        "hello" : "_8"
    },
    {
        "hello" : "_9"
    },
    {
        "hello" : "_10"
    }
]

Expression Symbols

It is possible to have a symbol that contains nothing but an expression, creating a ‘named expression’ or expression symbol. this symbol can be referred to in other expressions. It is useful for factoring out common code.

query {
    Where 'expr' equals ((string)symbol2 + 3);
    pattern =symbol1 =symbol2 =expr;
    result =symbol1 =symbol2 =expr;
}

Catching Undefined

If the undefined value is returned by an expression or subexpression, normally the execution will halt immediately with an ‘exception’ i.e. an error condition with an error message. However, we can prevent the halt by ‘catching’ the exception and doing something about it, such as by substituting some value for the expression or subexpression to actually return. An expression or subexpression is by default in ‘strict’ mode and will always throw the exception and cause a halt. In order to prevent that, you can put a symbol in ‘unstrict’ mode by setting the symbol attribute strict false. This will cause all references to the symbol to be unstrict and therefore to not halt. Also, a cast to (unstrict) on an expression or subexpression will prevent the halt. Undefined cannot be put into an item or tuple at any time, either temporarily during expression evaluation or in the database, and unstrict mode will just allow the entire erroneous item or tuple to turn into undefined during execution. There is also a cast to (strict).

Here we catch undefined with an instanceof undefined and return -1 instead. There are two expressions in this code – dangerous_expression1 and dangerous_expression2 – that can cause the undefined from a division by zero. The first one defaults to strict everywhere it is referenced, but the second sets the strict symbol attribute false, so where it is referenced it will always be unstrict. In safe_expression1 we have to cast the reference to dangerous_expression1 to (unstrict) to avoid the exception, but in safe_expression2, we do not need the cast.

query {
    Where {
        'dangerous_expression1' {
            comment 'this defaults to strict wherever referenced';
            equals (1 / series);
        }
        'dangerous_expression2' {
            comment 'defaults to unstrict wherever referenced';
            equals ((1 / series));
            strict false;
        }
        'safe_expression1' {
            comment 'allow dangerous_expression1 to be undefined';
            equals (
                 ((unstrict)dangerous_expression1) instanceof
                  undefined ? -1 : dangerous_expression1);
        }
        'safe_expression2' {
            comment 'dangerous_expression2 is already unstrict';
            equals (
                 dangerous_expression2 instanceof
                  undefined ? -1 : dangerous_expression2);
        }
        'series' {
            kind 'series';
            to 3;
        }
    }
    pattern =series;
    result =series 'first' ((safe_expression1)) 'second' ((safe_expression2));

We had to use two-deep parentheses around ((safe_expression1)) to force it to be an expression rather than come out as a symbol. A simple (safe_expression1) looks exactly like =safe_expression1 to the compiler, and that would prevent the error we are trying to show. The result is below. The first row shows the division by zero and the substitution to -1.

{
    0 'first' -1 'second' -1;
    1 'first' 1 'second' 1;
    2 'first' 0 'second' 0;
    3 'first' 0 'second' 0;
}