Bulk Loading and Sorting

Previous Next

The Need for Bulk Loading

When creating large databases, simply inserting random Items one-by-one into an InfinityDB index is impractically slow, and the bulk loading system should be used. Bulk loading uses a 'merge sorter' which can be given any unsorted sequence of Items, sort them, and return them in sorted order for any purpose. Typically the sorted output is inserted directly into an InfinityDB, but since the Items are no longer random, the insertions are fast. The performance for building an index this way with v1.0.53 10/9/2006 is approximately 66K Items/second at 2.8GHz with 20 Byte Items using the high-speed merge sorter, or half that for the low-space merge sorter. Any size index can be created. Look in the examples directory for demonstrations of bulk loading and sorting. The maximum practical index size for purely random keys without sorting is the block cache size as specified in the open and create methods of the InfinityDB class:
    public synchronized static InfinityDB open(String fileName,
            boolean isWriteable, long cacheSizeLimitBytes) throws IOException;
    public synchronized static InfinityDB create(String fileName,
            boolean overWrite, long cacheSizeLimitBytes) throws IOException {
The block cache is an in-memory set of 10K blocks of the BTree at various levels in the tree. The topmost or root block is always there, plus usually the next lower levels, and many blocks at the bottom or leaf level. When a next() or insert() or delete() is invoked, the tree is logically searched from root to leaf, using blocks that happen already to be in the cache. Blocks are read into the cache as needed, and the least recently used blocks are removed from the cache to make room. When random Items are inserted into a an index significantly larger than the cache, blocks at the bottom and near-bottom level will usually need to be read in and if necessary written out with each operation, so the performance limit will be determined by the seek performance of the drive, which is about 110 operations/sec. (The seek time of disk drives has changed very little over the last few decades.) A sorted input to the same InfinityDB can be very fast, because the Items are inserted into the single block at the logical end of the set of index blocks, and there are far fewer seeks. In fact, the final block simply stays in the cache, being inserted into and splitting into new blocks as necessary, with the new blocks created often being written directly onto the end of the index file. The speed of inserting Items into a block that is already in the cache is roughly 120K Items/sec at 2.8Ghz.

Using the Merge Sorter

To sort quickly quantities of Items that will not fit into available memory, temporary disk and memory space is required. There are two merge sorters, one for high speed but less disk space efficiency, and one for better disk space efficiency but lower speed. Both run faster with more temporary memory space.

For speed, do this:

	DiskBasedMergeSorterItemOutput sorter = new DiskBasedMergeSorterItemOutput(
		temporaryDirectoryFile, temporaryFileNamePrefix,
		long maxTemporarySpaceBytes);
	Cu cu = Cu.alloc();
	while (there are input Items) {
		// put Item in cu, then insert into the sorter
		sorter.insert(cu); // 'write' the Item into the sorter
	}
	// now retrieve Items
	ItemInput resultsItemInput = sorter.getResultsItemInput();
	while (resultsItemInput.readItem(cu)) { // read the sorted Items back
		// do something with Item in cu
		..
	}
	sorter.close(); // or do resultsItemInput.close();
The sequence of results can simply be inserted into any ItemSpace as they are read from the resultsItemSpace, and the performance will be excellent, with no significant size limit. As the size increases, the performance will decrease with the logarithm of the input size.

To get a more disk-space efficient sort:

	// Use an InfinityDB or other large ItemSpace to store temporary data,
	// possibly a subspace of the target or other ItemSpace
	ItemSpace tempSpace = InfinityDB.create(temporaryFilePath, true);
	ItemSpaceBasedMergeSorterItemOutput sorter = new ItemSpaceBasedMergeSorterItemOutput(
		tempSpace, maxTemporarySpaceBytes);
	// and use as above for DiskBasedMergeSortItemOutput..

Merge Sorting Overview

In general, typical merge sorters store temporary data on disk in a sequence of passes, each pass containing a set of two to 100 or more batches or runs (in Lucene, a popular open-source text content indexer, batches are are in individual files called 'segments'). The batches on disk contain sorted sequences of keys in some form, such as individual files or contiguous or scattered segments of a file. The amount of data or the key count in each pass increases geometrically with increasing pass number, with a multiple of two to 10 or more between passes, and each pass always consists of a set of one to roughly 10 or more batches. In many sorters, this pattern is maintained continously, by 'merging' batches as necessary while keys come in to the sorter. (Other mergers keep only a single pair of adjacent passes on disk at a time, but this increases the temporary space.) The smallest batches are created from the input data using in-memory sorters. When the number or size of the batches in a given pass exceeds a threshold, all of the batches in that pass are merged together to create one batch at the next higher pass. A merge operation is simple in principle: all of the input batches from a pass are scanned together in sorted order, and at any time during the merge, the next output is the smallest key available from any of the input batches.

The InfinityDB In-Memory Sorter

In the current InfinityDB implementation (as of 9/29/2006), the smallest batches are created in memory using a VolatileItemSpace. Because of the memory efficiency of this ItemSpace, the initial batches can be large, which reduces the likelyhood of needing extra passes.

The High-Speed Merge Sorter

The high-speed sorter currently stores each batch as a separate file containing sequential Items encoded as ItemPackets. An ItemPacket not an Object, but is a simple uncompressed encoding of an Item having a two-byte big-endian initial operation code followed by a two-byte big-endian content length in bytes, followed by a content part, which is usually a single Item. The Item is a sequence of big-endian two-byte chars up to Cu.MAX_ITEM_LENGTH chars long. The entire packet is up to ItemPacket.MAX_PACKET_LENGTH_BYTES bytes long (which includes expansion space beyond that required for only one Item).

The ItemPacket encoding is extremely fast and easy to parse, but it is not as space efficient as the encoding of Items in InfinityDB, which does dynamic compression. As a result, the temporary space on disk may be from one to ten times as big as the final InfinityDB database. This can be confusing, and it may seem that data is being lost or the sorter is not working correctly. It is important to remember that the goal is not to have an index of a certain size, but instead to store a given number of Items. Thus comparisons with other DBMS's should take into account the amount of source data, not the final index size. The same is true of performance: high source data rate is the goal. For example, the example program com.infinitydb.examples.InfinityDBSortingBulkLoad inserts 10M Items of 20 bytes each, which should take at least 200MB, but in fact the database takes roughly 22MB. The Items are randomized in a way that shows good compression, but there are other randomizers there to experiment with.

If a very large InfinityDB is required and there is not enough temporary space, it can be built up iteratively with a sequence of 'incremental' merge sorts, each result sorting Item sequence simply being inserted into the growing InfinityDB. This can be efficient up to a point, but the performance of each iteration will be less and less because the time will be determined mostly by the InfinityDB size. It may be possible to avoid using an incremental merge sort until the end. Also, the more disk-efficient ItemSpaceMergeSorterItemOutput can be used when temporary space is getting scarce. If the input Items are not actually uniformly random, then performance may still be good. Sometimes only a small set of the blocks in the growing InfinityDB will be touched by the new data, and there will be few seeks. For example, a set of Items related to a single Entity will usually go in one block, or an entirely new Inversion will be created and it will create a new set of adjacent Items. (Inversion is just the optional InfinityDB 'Entity-Attribute-Value' class and concept corresponding to a relational index, but it is more flexible.)

The temporary file name paths look like this:

	temporaryDirectoryPath/temporaryFileNamePrefixInfinityDBSortMergeTemp_passNumber_batchNumber
Where temporaryDirectoryPath and temporaryFileNamePrefix are as given in the constructor. There is also a lock file:
	temporaryDirectoryPath/temporaryFileNamePrefixInfinityDBSortMergeTemp.lock
The lock file is locked before each use with java.nio.channels.FileLock, in the tryLock mode by default. After each complete sort, all temporary files are deleted. No files without 'InfinityDBSortMergeTemp' in their names are created or deleted.

It is also possible to create a custom temporary file manager by extending DiskBasedMergeSorterTempFileManager and providing it to the merge sorter constructor.

The High-Density Merge Sorter

If disk space is limited or it is desired to avoid allocating and keeping track of temporary files and directories, it may be better to use the slower ItemSpaceBasedMergeSorterItemOutput. This sorter keeps all temporary data in an ItemSpace, which would normally be an InfinityDB. Because there is an optional temporarySpacePrefix in the constructor, it is possible keep the temporary data inside the InfinityDB that is the target of the sort. This will not only simplify management, but also will return the temporary space inside the InfinityDB for future expansion (currently InfinityDB files do not shrink dynamically). The temporarySpacePrefix is prepended to all Items in the temporary ItemSpace. The temporary Items currently have the structure:
temporarySpacePrefix passNumber batchNumber Item
The temporary ItemSpace can of course be on disk in a separate InfinityDB, so that all temporary space will be retured to the file system when the sort is done. The temporary space required for the largest pass will be approximately equal to the final target Infinity size, unless an iterative approach is used as describe above for the final steps.

Previous Next


Copyright © 1997-2006 Boiler Bay.