AndSpaces
and OrSpaces, then show how
simple it can be to use them in queries with complex 'where' constraints.
AndSpaces create a logical or 'virtual' view of one or more
'input' ItemSpaces
as an intersection. OrSpaces
create a union view of the input ItemSpaces. It is also possible
to make the Items in one or more AndSpace inputs
appear to be absent from the other inputs - a 'negative'
input, creating a set difference, but there must be at
least one positive input. AndSpace and OrSpace
ItemSpaces are read-only.
It is still possible for the underlying input
spaces to be modified themselves, and the changes
will be immediately visible in the containing AndSpace or
OrSpace. The superclass of both of these is AndOrSpace.
Such 'virtual' ItemSpaces present a view that can be used in any situation where an ItemSpace is needed. Virtual ItemSpaces are usually lightweight and always thread-safe. They can be used in a wide range of situations, such as for efficient queries or for dynamic views that can be presented to a user with no query delay, while keeping the convenient, familiar, random/sequential access capabilities of the simple ItemSpace model.
The computation of these ItemSpaces is done
dynamically, as the retrieval operations are
executed. The results often take only the time
necessary to examine each input once, but
AndSpaces have an internal loop that takes
time dependent on the data, although it is still
very efficient. No temporary
space is needed, and the only 'startup' cost
is the time and space needed for the AndSpace
and OrSpace constructions themselves.
There is no Cu inside these spaces or any other
heavyweight Objects to dispose(), although other
virtual ItemSpaces may be heavyweight.
AndSpace algorithm is a 'zigzag', in which
the Cu supplied to the next() or
other retrieveal method is scanned over the
input ItemSpaces until a match between all of
them is found. The search can be extremely fast because
when one of the inputs is examined, it will
return a 'nearest' Item, and that Item can
be placed in the Cu for the next loop. This
means many Items may be skipped in the other
inputs.
The inputs are scanned with the first inputs being the 'highest priority', so the first inputs should also be the smallest in terms of Item count, if this is known in advance, although this is not in general necessary. The inputs are looked at 'round-robin', so each input has a chance to cause a long-distance skip. The round-robin scan starts from the initial input on each invocation of a retrieval method.
In any case, the size of the smallest input set is the worst case in determining the number of total loop iterations that will be required in scanning the total virtual ItemSpace. The time for any particular retrieval depends on the data, however. For example, the virtual ItemSpace may be empty because there is no intersection between the input ItemSpaces, and the Items in the inputs mostly 'interleave': in that case, the first and only retrieval will be proportional to the time it takes to scan the smallest input. Often, however, the interleaving is ragged and the zig-zag algorithm will have many opportunities to skip many input Items, resulting in a much smaller loop time. There is no redundant access of individual Items, pre-computation or any kind of temporary space needed. When there are negative inputs, they do not contribute to the zig-zag speedup effect, but are tested after a potential match is found based on the positive inputs.
AndSpace example, suppose
we put some Items in an ItemSpace and intersect them. We want
all the students taught by both of two teachers. The SQL is:
SELECT A.TS_STUDENT FROM TEACHER_STUDENT A TEACHER_STUDENT B WHERE A.TS_STUDENT=B.TS_STUDENT AND A.TS_TEACHER = ? AND B.TS_TEACHER = ?;Not including the JDBC code. What is the query plan and performance? It depends on the set of indexes available and on the statistics. At the bottom of the code below is the actual InfinityDB query. A VolatileItemSpace holds Items in memory (not on disk) and is wrapped by two ItemSubspaces using prefixes that allow viewing the sets of Values for the TEACHES Attribute of the two TEACHER Entities. These are used both in the query and for initializing the database. The query consists in combining the two dbs in an AndSpace to intersect the two sets of Values.
package com.infinitydb.manualexamples;
import java.io.IOException;
import com.infinitydb.AndSpace;
import com.infinitydb.Attribute;
import com.infinitydb.Cu;
import com.infinitydb.EntityClass;
import com.infinitydb.ItemSpace;
import com.infinitydb.ItemSubspace;
import com.infinitydb.VolatileItemSpace;
public class AndOrSpaceManualExample {
static final EntityClass TEACHER = new EntityClass(0);
static final Attribute TEACHES = new Attribute(0);
public static void main(String[] args) throws IOException {
ItemSpace db = new VolatileItemSpace();
// db is the InfinityDB, the container ItemSpace for all the data
// Make an ItemSubspace over the students taught by teacher1
// The TEACHES Attribute is multi-valued
Cu cu = Cu.alloc();
String teacher1 = "jones";
cu.clear().append(TEACHER).append(teacher1).append(TEACHES);
ItemSpace db1 = new ItemSubspace(db, cu);
// Make an ItemSubspace over the students taught by teacher2
String teacher2 = "jackson";
cu.clear().append(TEACHER).append(teacher2).append(TEACHES);
ItemSpace db2 = new ItemSubspace(db, cu);
// populate the two db's, i.e. add students to each teacher
db1.insert("ari"); // db1 is smaller. We know this, so we put it first
db1.insert("miller");
db1.insert("schmidt");
db2.insert("ada"); // Strings, wrappers, and CuAppendables can be inserted directly
db2.insert("miller");
db2.insert("ono");
db2.insert("patron");
db2.insert("smith");
db2.insert("zeno");
db2.insert("schmidt");
db2.insert("zero");
// The Set-based InfinityDB Query
AndSpace andSpace = new AndSpace(); // intersect the two student sets
andSpace.add(db1);
andSpace.add(db2); // any number of inputs can be used
//andSpace.add(db3, false); // 'negative' inputs create differences
while (andSpace.next(cu)) { // enumerate the intersection
System.out.println("intersection contains:" + cu);
}
// prints
// intersection contains:"miller"
// intersection contains:"schmidt"
}
}
The algorithm starts with db1, getting "ari" first,
then, when checking whether "ari" is in db2, we get "miller".
So, we look back at db1, finding that "miller" is
indeed there too, so we have a match and we return it. We have skipped
"ada" in db2.
On the next next() retrieval, we start with "miller" in db1,
finding "schmidt", which we then check in db2. Db2 contains "smith", which
is still not a match, we go back to db1, where we find nothing after
"schmidt", and we are done. We never looked at 5 of the Items. Depending
on the data, the degree of improvement due to the
zig-zag algorithm may be very great. Each
access of db2 might require a block I/O, for example.
next() or
another of the retrieval operations, it has to do a
next() on each input to determine
the nearest Item, which it leaves in the provided cu.
An optimal algorithm for purely sequential scanning, such as in a sort/merge, needs to keep a set of Items, one per input, and continually re-sort them amongst themselves to determine the next input to merge. The purpose of virtual ItemSpaces, on the other hand, is to be efficient, light-weight multi-threaded random-access ItemSpace views, thus they cannot have state such as Cu's, which are for single threads.
OrSpaces can do many things. Another example might be to view a combination of multiple distinct databases which might reside in different files. Client code would not be aware of the situation other than the fact that the db is read-only (this may change.) If desired, the OrSpace can be scanned from start to end and written into a new db in order to permanently combine the inputs.
There are many sort algorithms (!). InfinityDB uses the
basic 'sort/merge' approach, which is very common. For example,
(The com.infinitydb.examples.FileIndexer
example application shows one efficient way to do
sort/merge inside an ItemSpace, but it is not
at the introductory level.) Also, the bulk
loaders DiskBasedMergeSorterItemOutput and
ItemSpaceBasedMergeSorterItemOutput can do efficient
sorting of large numbers of Items for stateful
combination of sets of Items that can then be
inserted or deleted from an ItemSpace. Note also
IncrementalMergingItemSpace, which can do stateful
merging of Items invisibly, allowing both loading and
retrieval all at once (thus apparently violating the rule
stated above: unfortunately it only allows single thread
dynamic access).
AndSpaces and OrSpaces
as inputs to other AndSpaces and OrSpaces.
Nesting And's in And's or Or's in Or's works
exactly as described above. In general,
same-type nestings of ands-in-ands and ors-in-ors work identically . In order to
simplify a complex nesting, using
AndOrSpace.optimize() is
recommended: it will
flatten the trees by merging same-type
nestings, and apply deMorgan's laws in
many cases. It is not a general logic
minimization algorithm and the speedup is
hard to predict, but the optimization is fast.
AndSpaces and OrSpaces are
stateless, so the optimizer can treat them
as if they were immutable, and the result of
optimize may be a new
AndSpace or OrSpace.
RangeItemSpace. To do this,
wrap a RangeItemSpace around
any input ItemSpace, giving it the
limits as parameters. The baseSpace
can be an AndSpace or OrSpace or another
RangeItemSpace or an InfinityDB and so on.
public RangeItemSpace(ItemSpace baseSpace,Cu cuStart,Cu cuEnd,
boolean isModifiable,
boolean isModifiableOutsideRange,
boolean throwExceptionOnModificationOutsideRange) {
Either cuStart or cuEnd may
be null to avoid placing a limit. Below
is an example of a usage of RangeItemSpace.
The baseDB is a set of Items each
containing only one string component. Any
Item structure with multiple components
of various types can be used as start and
end in any ItemSpace. The resulting
RangeItemSpace riscan in turn be used anywhere,
as a first-class citizen.
Cu cuStart = cu.alloc("M"); // first value allowed: "M"
Cu cuEnd = cu.alloc("O"); // after last value allowed: "0"
RangeItemSpace ris = new RangeItemSpace(baseDB, cuStart, cuEnd, true, false, true);
while (ris.next(cu)) { // enumerate the intersection
System.out.println("range contains:" + cu);
}
cuStart.dispose(); cuEnd.dispose(); // optional for speed.
The above allows "M" but disallows "O" in the
ItemSpace ris. The Items
within the range will be visible to operations
like ris.next(Cu cu, int pl) and so on. It also allows
modifications within the visible range, but not outside,
where it throws an IOException.
next() on the OrSpace will invoke
next() on each input, throwing
away all of the results but the nearest. By
throwing away results, the OrSpace can cause
a nested AndSpace to have to repeat slow loops
as described above. Actually, the problem
is more general than this, because it occurs
whenever there is an OrSpace containing
ItemSpaces that are slow to recompute.
One solution to this problem is to scan each AndSpace completely, storing the Items retrieved in one temporary subspace for each AndSpace, and then to use an OrSpace over the temporary spaces. The AndSpaces are computed very efficiently with the zig-zag algorithm, and the temporary space may be very small.
Another solution to this problem is Laziness.