June 1, 2006

Prefix compression - It affects your world view

David X wrote:

> Dear Support,
>  
> Suppose I enter a number of people in the database as below:
>  
> // persons[]:  array of person names to add to the 
< database (n == numberOfPersonsInArray)
> for (i=0; i<n; i++) {
>     cu = cu.clear().append(ENTITY_TYPE).append("person").append(person[i]);
>     db.insert();
> }
>  
> Keeping in mind that .append(ENTITY_TYPE).append("person") are always 
> constant, I would expect the tree to look like the following:
>  
> //                     "ENTITY_TYPE"
> //                           |
> //                         "person"
> //                           |
> //              +------------+--------------+
> //              |            |              |
> //           person[0]    person[1]  ... person[n-1]
>  
> However, "ENTITY_TYPE" and "person" are also stored in the database 
> "n" times, not once like I expected.
>  
> I would like to know how to think about "ENTITY_TYPE" and 
"person", which I look at as singular entries, but are not.  
I assume this has to do with each append() putting a new pointer into the database.
>  
> I am not having any problems with this, because it all works fine; 
but I am wanting to build a hierarchy from this information, and 
the multiple entries for each "ENTITY_TYPE" and "person" surprised 
me (the hierarchy can still be built by ignoring duplicate entries).
>  
>  
> David
>  

Yes, thought about as a completely independent series of Items, it does look
like there is duplication of the EntityClass and Entity when  you  use next() to
scan over a subtree or 'subspace'.

I think you have to remember the way the next() and other operations work
with a protected prefix length. That protected prefix length will always cover the part of
the Items that are to be considered 'higher' in the tree the way you drew it,
so that higher part of the data does not change in the Cu. When you use next(), you are
always only looking at the changes in the Cu beyond that prefix length, and you
ignore the common prefix. So the fact that the prefix is still sitting in there unchanged
really doesn't matter practically.

>From the perspective of any ItemSpace, Items are actually only simple variable-length
char sequences kept in order, but they are basically
always stored in the database with prefixes shared between adjacent Items
compressed out at the character level. This compression is transparent, though, so
there is no way any user of an ItemSpace can know that prefix
compression is going on. However, it would just be so ineffiicient to store
the common prefixes that you get used to assuming that that
prefix compression is going on, and the apparent duplication of the prefixes
when you just list out the Items in sequence doesn't bother you after a while.
The ItemEditor shows the apparent duplications as well, and you just have to look
past that.

So there actually aren't any special internal pointers or other tricky stuff
going on. An ItemSpace is just an ordered sequence of Items, each Item  is
a variable-length sequence of up to 1600 chars [which is not a significant
limit due to the ability always to structure data in ways that avoids 
the problem - ed]. The higher-level meaning
of Items as a series of components is actually not something any ItemSpace knows
about. That's why it's so easy to invent new ItemSpaces, like the DeltaItemSpace
or the VolatileItemSpace in the latest release. The concept of components is
actually totally restricted to the append(), skipXX(), and other operations on Cu.
(In fact you aren't even restricted to use those component-oriented
operations: you could invent your own meaning of the chars in the
Item by using raw operations like Cu.appendChar(c) and so on.)

One way you would be concerned about the apparent duplication of initial prefixes would be
if you wrote a simple database dumper that scanned all the Items in the db using
db.next(cu,0) and then wrote each Item out as a raw sequence of chars. There
would be no compression of shared adjacent prefixes, and the output would be
rather inefficient. It would be easy to fix that dumper to output only the
suffixes that actually change from Item to Item. That would be as efficient as
the InfinityDB internal compression.

I agree with you now that I think about it that that apparent duplication of
prefixes is a bit disconcerting at first. But the whole Idea of an EAV-structured
database, or even most of the other ItemSpace-based structures you might use, depend on
the internal compression inside the ItemSpace implementation for efficiency.

Good point.

Roger

July 6, 2005

InfinityDB's single file architecture and logging

This question is about logging—our’s vs. that of Berkelely DB. The potential customer is intrigued to hear about our single-file architecture and wants to know how we accomplish logging and how that compares with Berkeley DB.

Our reply:

InfinityDB and BerkeleyDB JE use completely different file storage systems, with a dramatic difference in storage efficiency and performance.

‘JE’ keeps all information in a changing series of numbered ‘log’ files,which actually contain all data as well as transaction logs. All data is appended to the currentlog file,which means the files continue to grow as database updates progress. There is no limit to the size of these files except that occasionally a’cleaner’thread condenses the older log files into the current one by copying everything that has not been freed. The result is that every database modification creates ‘garbage’ which takes up space in older log files, while conversely, every piece of data that is not modified must eventually be ‘copied forwards’ on each cleanup cycle. There is therefore a tradeoff between space and time efficiency which does not necessarily have an acceptable optimum. Note also that normally a special directory must be used in which the files live and change, making it difficult to use a database in many applications.

InfinityDB keeps all information in one file with a static name. The free space is located inside the file with the data, on a free list, for instant allocation. Data modifications are paged in and out of the file as needed from the cache, and are written over existing free space, leaving commited data untouched. A commit of the single global transaction flushes the remaining dirty pages from the cache and writes a single header block that locks in the changes to the data and free list. Any system or application failure will leave commited data and the free list in the original state, so no log is needed. There is no size limit or performance penalty forlarge transactions, which simply use up the free space and then append to the file as needed. There is no need to copy unchanged data, and there is no garbage to collect, as freed blocks go immediately on the transactional free list. The worst-case temporary free space usage occurs when a transaction modifies every block in the database before the commit, in which case the space efficiency drops to 50% until further inserts bring it back up again. However it is important to remember that this worst-case 50% efficiency applies to already-compressed data, so the effective efficiency with respect to the original uncompressed data can be much higher.

The storage efficiency of InfinityDB is unparalled. With a typical pattern of inserts and occasional commits, the free block space in the file can be kept insignificantly small – only a few percent. This depends on doing commits reasonably frequently. If there are more deletions than inserts, free space increases, but is always reused when needed again. BTree blocks are actually stored variable-length, therefore can be compressed using UTF-8and then ZLib before being written, leaving virtually no internal fragmentation and extremely dense data. It is important to realize that this free space ‘surrounds’ already compressed data, so as a fraction of the original uncompressed data, it is effectively even smaller. In one test, uncompressable Secure Hash Algorithm data was stored with exactly the minimum theoretical space. In other tests, 28% free space was observed in a database storing worst-case uncompressable data.