public synchronized static InfinityDB open(String fileName,
boolean isWriteable, long cacheSizeLimitBytes) throws IOException;
public synchronized static InfinityDB create(String fileName,
boolean overWrite, long cacheSizeLimitBytes) throws IOException {
The block cache is an in-memory set of 10K blocks of the BTree at various levels
in the tree. The topmost or root block is always there, plus usually the next
lower levels, and many blocks at the bottom or leaf level. When a next()
or insert() or delete() is invoked, the tree is logically searched from root to
leaf, using blocks that happen already to be in the cache. Blocks
are read into the cache as needed, and the least recently used blocks
are removed from the cache to make room. When random Items are inserted
into a an index significantly larger than the cache,
blocks at the bottom and near-bottom
level will usually need to be read in and if necessary
written out with each operation,
so the performance limit will be determined by
the seek performance of the drive, which is
about 110 operations/sec. (The seek time of disk
drives has changed very little over the last few
decades.) A sorted input to the same InfinityDB can be very fast,
because the Items are inserted into the single block at the
logical end of the set of index blocks, and there are
far fewer seeks. In fact, the final block simply stays in
the cache, being inserted into and splitting into new blocks
as necessary, with the new blocks created often being written
directly onto the end of the index file.
The speed of inserting Items into a block that is
already in the cache is roughly 120K Items/sec at 2.8Ghz.
For speed, do this:
DiskBasedMergeSorterItemOutput sorter = new DiskBasedMergeSorterItemOutput(
temporaryDirectoryFile, temporaryFileNamePrefix,
long maxTemporarySpaceBytes);
Cu cu = Cu.alloc();
while (there are input Items) {
// put Item in cu, then insert into the sorter
sorter.insert(cu); // 'write' the Item into the sorter
}
// now retrieve Items
ItemInput resultsItemInput = sorter.getResultsItemInput();
while (resultsItemInput.readItem(cu)) { // read the sorted Items back
// do something with Item in cu
..
}
sorter.close(); // or do resultsItemInput.close();
The sequence of results can simply be inserted into any ItemSpace as
they are read from the resultsItemSpace, and the performance will be excellent,
with no significant size limit. As the size increases, the performance
will decrease with the logarithm of the input size.
To get a more disk-space efficient sort:
// Use an InfinityDB or other large ItemSpace to store temporary data, // possibly a subspace of the target or other ItemSpace ItemSpace tempSpace = InfinityDB.create(temporaryFilePath, true); ItemSpaceBasedMergeSorterItemOutput sorter = new ItemSpaceBasedMergeSorterItemOutput( tempSpace, maxTemporarySpaceBytes); // and use as above for DiskBasedMergeSortItemOutput..
InfinityDB implementation (as of 9/29/2006),
the smallest batches are created in memory using a VolatileItemSpace.
Because of the memory efficiency of this ItemSpace, the initial
batches can be large, which reduces the likelyhood of needing extra
passes.
ItemPacket not an Object, but is a
simple uncompressed encoding of an Item
having a two-byte big-endian initial operation code followed by a
two-byte big-endian content length in bytes, followed by a
content part, which is usually a single Item. The Item is
a sequence of big-endian two-byte chars up to Cu.MAX_ITEM_LENGTH
chars long. The entire packet is up to ItemPacket.MAX_PACKET_LENGTH_BYTES
bytes long (which includes expansion space beyond that required for only one
Item).
The ItemPacket encoding is extremely fast and easy to parse, but it is not as space efficient as the encoding of Items in InfinityDB, which does dynamic compression. As a result, the temporary space on disk may be from one to ten times as big as the final InfinityDB database. This can be confusing, and it may seem that data is being lost or the sorter is not working correctly. It is important to remember that the goal is not to have an index of a certain size, but instead to store a given number of Items. Thus comparisons with other DBMS's should take into account the amount of source data, not the final index size. The same is true of performance: high source data rate is the goal. For example, the example program com.infinitydb.examples.InfinityDBSortingBulkLoad inserts 10M Items of 20 bytes each, which should take at least 200MB, but in fact the database takes roughly 22MB. The Items are randomized in a way that shows good compression, but there are other randomizers there to experiment with.
If a very large InfinityDB is required and there is not enough
temporary space, it can be built up iteratively with a
sequence of 'incremental' merge sorts, each result sorting Item sequence
simply being inserted into the growing InfinityDB. This can be
efficient up to a point, but the performance of each
iteration will be less and less because the time will be
determined mostly by the InfinityDB size. It may be possible
to avoid using an incremental merge sort until the end.
Also, the more disk-efficient
ItemSpaceMergeSorterItemOutput can
be used when temporary space is getting scarce. If the
input Items are not actually uniformly random, then
performance may still be good. Sometimes only a small
set of the blocks in the growing InfinityDB will be
touched by the new data, and there will be few seeks.
For example, a set of Items related to a single
Entity will usually go in one block, or an entirely
new Inversion will be created and it will
create a new set of adjacent Items.
(Inversion is just the optional InfinityDB
'Entity-Attribute-Value' class and concept corresponding to a
relational index, but it is more flexible.)
The temporary file name paths look like this:
temporaryDirectoryPath/temporaryFileNamePrefixInfinityDBSortMergeTemp_passNumber_batchNumberWhere temporaryDirectoryPath and temporaryFileNamePrefix are as given in the constructor. There is also a lock file:
temporaryDirectoryPath/temporaryFileNamePrefixInfinityDBSortMergeTemp.lockThe lock file is locked before each use with
java.nio.channels.FileLock,
in the tryLock mode by default. After each complete sort, all temporary files are
deleted. No files without 'InfinityDBSortMergeTemp' in their names are created or deleted.
It is also possible to create a custom temporary file manager
by extending DiskBasedMergeSorterTempFileManager and
providing it to the merge sorter constructor.
ItemSpaceBasedMergeSorterItemOutput. This sorter keeps all temporary
data in an ItemSpace, which would normally be an InfinityDB. Because
there is an optional temporarySpacePrefix in the constructor,
it is possible keep the temporary data inside the InfinityDB that is the
target of the sort. This will not only simplify management, but also will
return the temporary space inside the InfinityDB for future expansion (currently
InfinityDB files do not shrink dynamically). The temporarySpacePrefix
is prepended to all Items in the temporary ItemSpace. The temporary Items
currently have the structure:
temporarySpacePrefix passNumber batchNumber ItemThe temporary ItemSpace can of course be on disk in a separate InfinityDB, so that all temporary space will be retured to the file system when the sort is done. The temporary space required for the largest pass will be approximately equal to the final target Infinity size, unless an iterative approach is used as describe above for the final steps.