AndSpaces and OrSpaces

Previous Next

The Logic of AndSpaces and OrSpaces

First we will discuss the logic of AndSpaces and OrSpaces, then show how simple it can be to use them in queries with complex 'where' constraints.

AndSpaces create a logical or 'virtual' view of one or more 'input' ItemSpaces as an intersection. OrSpaces create a union view of the input ItemSpaces. It is also possible to make the Items in one or more AndSpace inputs appear to be absent from the other inputs - a 'negative' input, creating a set difference, but there must be at least one positive input. AndSpace and OrSpace ItemSpaces are read-only. It is still possible for the underlying input spaces to be modified themselves, and the changes will be immediately visible in the containing AndSpace or OrSpace. The superclass of both of these is AndOrSpace.

Such 'virtual' ItemSpaces present a view that can be used in any situation where an ItemSpace is needed. Virtual ItemSpaces are usually lightweight and always thread-safe. They can be used in a wide range of situations, such as for efficient queries or for dynamic views that can be presented to a user with no query delay, while keeping the convenient, familiar, random/sequential access capabilities of the simple ItemSpace model.

The computation of these ItemSpaces is done dynamically, as the retrieval operations are executed. The results often take only the time necessary to examine each input once, but AndSpaces have an internal loop that takes time dependent on the data, although it is still very efficient. No temporary space is needed, and the only 'startup' cost is the time and space needed for the AndSpace and OrSpace constructions themselves. There is no Cu inside these spaces or any other heavyweight Objects to dispose(), although other virtual ItemSpaces may be heavyweight.

AndSpace Algorithm

The AndSpace algorithm is a 'zigzag', in which the Cu supplied to the next() or other retrieveal method is scanned over the input ItemSpaces until a match between all of them is found. The search can be extremely fast because when one of the inputs is examined, it will return a 'nearest' Item, and that Item can be placed in the Cu for the next loop. This means many Items may be skipped in the other inputs.

The inputs are scanned with the first inputs being the 'highest priority', so the first inputs should also be the smallest in terms of Item count, if this is known in advance, although this is not in general necessary. The inputs are looked at 'round-robin', so each input has a chance to cause a long-distance skip. The round-robin scan starts from the initial input on each invocation of a retrieval method.

In any case, the size of the smallest input set is the worst case in determining the number of total loop iterations that will be required in scanning the total virtual ItemSpace. The time for any particular retrieval depends on the data, however. For example, the virtual ItemSpace may be empty because there is no intersection between the input ItemSpaces, and the Items in the inputs mostly 'interleave': in that case, the first and only retrieval will be proportional to the time it takes to scan the smallest input. Often, however, the interleaving is ragged and the zig-zag algorithm will have many opportunities to skip many input Items, resulting in a much smaller loop time. There is no redundant access of individual Items, pre-computation or any kind of temporary space needed. When there are negative inputs, they do not contribute to the zig-zag speedup effect, but are tested after a potential match is found based on the positive inputs.

AndSpace Example

The purpose of virtually intersecting ItemSpaces is better shown in an example query. For a simple AndSpace example, suppose we put some Items in an ItemSpace and intersect them. We want all the students taught by both of two teachers. The SQL is:
	SELECT A.TS_STUDENT FROM TEACHER_STUDENT A TEACHER_STUDENT B
	WHERE A.TS_STUDENT=B.TS_STUDENT
	AND A.TS_TEACHER = ? 
	AND B.TS_TEACHER = ?;
Not including the JDBC code. What is the query plan and performance? It depends on the set of indexes available and on the statistics. At the bottom of the code below is the actual InfinityDB query. A VolatileItemSpace holds Items in memory (not on disk) and is wrapped by two ItemSubspaces using prefixes that allow viewing the sets of Values for the TEACHES Attribute of the two TEACHER Entities. These are used both in the query and for initializing the database. The query consists in combining the two dbs in an AndSpace to intersect the two sets of Values.
package com.infinitydb.manualexamples;

import java.io.IOException;

import com.infinitydb.AndSpace;
import com.infinitydb.Attribute;
import com.infinitydb.Cu;
import com.infinitydb.EntityClass;
import com.infinitydb.ItemSpace;
import com.infinitydb.ItemSubspace;
import com.infinitydb.VolatileItemSpace;

public class AndOrSpaceManualExample {
	static final EntityClass TEACHER = new EntityClass(0);
	static final Attribute TEACHES = new Attribute(0);
	
	public static void main(String[] args) throws IOException {
		ItemSpace db = new VolatileItemSpace();
		// db is the InfinityDB, the container ItemSpace for all the data

		// Make an ItemSubspace over the students taught by teacher1
		// The TEACHES Attribute is multi-valued
		Cu cu = Cu.alloc();
		String teacher1 = "jones";
		cu.clear().append(TEACHER).append(teacher1).append(TEACHES);
		ItemSpace db1 = new ItemSubspace(db, cu);
		
		// Make an ItemSubspace over the students taught by teacher2
		String teacher2 = "jackson";
		cu.clear().append(TEACHER).append(teacher2).append(TEACHES);
		ItemSpace db2 = new ItemSubspace(db, cu);

		// populate the two db's, i.e. add students to each teacher
		db1.insert("ari");  // db1 is smaller. We know this, so we put it first
		db1.insert("miller");
		db1.insert("schmidt");

		db2.insert("ada"); // Strings, wrappers, and CuAppendables can be inserted directly
		db2.insert("miller");
		db2.insert("ono");
		db2.insert("patron");
		db2.insert("smith");
		db2.insert("zeno");
		db2.insert("schmidt");
		db2.insert("zero");

		// The Set-based InfinityDB Query	
		AndSpace andSpace = new AndSpace(); // intersect the two student sets
		andSpace.add(db1);
		andSpace.add(db2); // any number of inputs can be used
		//andSpace.add(db3, false); // 'negative' inputs create differences
		while (andSpace.next(cu)) { // enumerate the intersection
			System.out.println("intersection contains:" + cu);
		}
		// prints
		// intersection contains:"miller"
		// intersection contains:"schmidt"
	}
}
The algorithm starts with db1, getting "ari" first, then, when checking whether "ari" is in db2, we get "miller". So, we look back at db1, finding that "miller" is indeed there too, so we have a match and we return it. We have skipped "ada" in db2. On the next next() retrieval, we start with "miller" in db1, finding "schmidt", which we then check in db2. Db2 contains "smith", which is still not a match, we go back to db1, where we find nothing after "schmidt", and we are done. We never looked at 5 of the Items. Depending on the data, the degree of improvement due to the zig-zag algorithm may be very great. Each access of db2 might require a block I/O, for example.

OrSpace Algorithm

The OrSpace algorithm is very simple, but not optimal, compared to a merge as used in typical sort/merge applications or compared to AndSpace. It is nevertheless quite sufficient for many uses and is very easy to set up; the usage is identical to AndSpace, but there can be no negative inputs. When the OrSpace sees a next() or another of the retrieval operations, it has to do a next() on each input to determine the nearest Item, which it leaves in the provided cu.

An optimal algorithm for purely sequential scanning, such as in a sort/merge, needs to keep a set of Items, one per input, and continually re-sort them amongst themselves to determine the next input to merge. The purpose of virtual ItemSpaces, on the other hand, is to be efficient, light-weight multi-threaded random-access ItemSpace views, thus they cannot have state such as Cu's, which are for single threads.

OrSpaces can do many things. Another example might be to view a combination of multiple distinct databases which might reside in different files. Client code would not be aware of the situation other than the fact that the db is read-only (this may change.) If desired, the OrSpace can be scanned from start to end and written into a new db in order to permanently combine the inputs.

Stateful OR-ing

Optimal algorithms for high performance for 'OR'ing a collection of input ItemSpaces or any other input streams of unsorted Items involves keeping state - normally at least a single Item per input. Thus these algorithms are not normally useful for simultaneous multi-threaded access. See

There are many sort algorithms (!). InfinityDB uses the basic 'sort/merge' approach, which is very common. For example, (The com.infinitydb.examples.FileIndexer example application shows one efficient way to do sort/merge inside an ItemSpace, but it is not at the introductory level.) Also, the bulk loaders DiskBasedMergeSorterItemOutput and ItemSpaceBasedMergeSorterItemOutput can do efficient sorting of large numbers of Items for stateful combination of sets of Items that can then be inserted or deleted from an ItemSpace. Note also IncrementalMergingItemSpace, which can do stateful merging of Items invisibly, allowing both loading and retrieval all at once (thus apparently violating the rule stated above: unfortunately it only allows single thread dynamic access).

Nested AndSpaces and OrSpaces

It is reasonable to use AndSpaces and OrSpaces as inputs to other AndSpaces and OrSpaces. Nesting And's in And's or Or's in Or's works exactly as described above. In general, same-type nestings of ands-in-ands and ors-in-ors work identically . In order to simplify a complex nesting, using AndOrSpace.optimize() is recommended: it will flatten the trees by merging same-type nestings, and apply deMorgan's laws in many cases. It is not a general logic minimization algorithm and the speedup is hard to predict, but the optimization is fast. AndSpaces and OrSpaces are stateless, so the optimizer can treat them as if they were immutable, and the result of optimize may be a new AndSpace or OrSpace.

Range ItemSpaces

A query involving range criteria can use a RangeItemSpace. To do this, wrap a RangeItemSpace around any input ItemSpace, giving it the limits as parameters. The baseSpace can be an AndSpace or OrSpace or another RangeItemSpace or an InfinityDB and so on.
    public RangeItemSpace(ItemSpace baseSpace,Cu cuStart,Cu cuEnd,
                    boolean isModifiable,
                    boolean isModifiableOutsideRange,
                    boolean throwExceptionOnModificationOutsideRange) {
Either cuStart or cuEnd may be null to avoid placing a limit. Below is an example of a usage of RangeItemSpace. The baseDB is a set of Items each containing only one string component. Any Item structure with multiple components of various types can be used as start and end in any ItemSpace. The resulting RangeItemSpace riscan in turn be used anywhere, as a first-class citizen.
	Cu cuStart = cu.alloc("M"); // first value allowed: "M" 
	Cu cuEnd = cu.alloc("O"); // after last value allowed: "0"
	RangeItemSpace ris = new RangeItemSpace(baseDB, cuStart, cuEnd, true, false, true);
	while (ris.next(cu)) { // enumerate the intersection
		System.out.println("range contains:" + cu);
	}
	cuStart.dispose(); cuEnd.dispose(); // optional for speed.
The above allows "M" but disallows "O" in the ItemSpace ris. The Items within the range will be visible to operations like ris.next(Cu cu, int pl) and so on. It also allows modifications within the visible range, but not outside, where it throws an IOException.

Laziness for Inefficient Nestings

There are nestings which can be inefficient. For example, an AndSpace with a small intersection (smaller than the smallest input), but with a large smallest input, nested inside a large OrSpace can be slow, because each next() on the OrSpace will invoke next() on each input, throwing away all of the results but the nearest. By throwing away results, the OrSpace can cause a nested AndSpace to have to repeat slow loops as described above. Actually, the problem is more general than this, because it occurs whenever there is an OrSpace containing ItemSpaces that are slow to recompute.

One solution to this problem is to scan each AndSpace completely, storing the Items retrieved in one temporary subspace for each AndSpace, and then to use an OrSpace over the temporary spaces. The AndSpaces are computed very efficiently with the zig-zag algorithm, and the temporary space may be very small.

Another solution to this problem is Laziness.

Previous Next


Copyright © 1997-2006 Boiler Bay.