CharacterLongObjects and BinaryLongObjects

Previous Next

Basic Long Object Usage

A 'BinaryLongObject' is handled an InputStream or OutputStream that is persisted inside the db, like an OS file. A CharacterLongObject uses either a Reader or Writer. It is identified by any desired unique Item prefix, and the data is stored as a sequence of Items sharing the prefix. Many RDBMS also have LongObjects or LOB's. A Long Object be used as the Value part of an EAV structure, if the EAV model is used, or on any other Item prefix.

To store a CharacterLongObject, the code is very straightforward:

    Cu cu = Cu.alloc().append("my prefix");
    CharacterLongObjectWriter w = new CharacterLongObjectWriter(db, cu);
    w.setKeepResourcesOnClose(true); // Optional, for reusability
    Random random = new Random(1234); // seed. 
    for (int i = 0; i < 1000000; i++) {
        w.write((random.nextInt() % 26) + 'A');
    }
    w.close(); // Flushes. Internal buffer and Cu preserved for speed
To read it back again the code is also straightforward:
    CharacterLongObjectReader r = new CharacterLongObjectReader(db, cu);
    r.setKeepResourcesOnClose(true); // Optional, for reusability
    for (int c; (c = r.read()) != -1;) {
   	    // use c
    }
    r.close(); // Internal Cu preserved for speed

Storage of BinaryLongObjects is essentially the same.

Space and Time Efficiency

We show the random ASCII chars being generated to emphasize that many types of data are compressed inside InfinityDB. The resulting space of such a 'file' can be much less than that used by the OS for normal files because of the ZLib Huffman coding, LZW common substring factoring, and UTF-8 compression, combined with variable-length blocks and prefix compression, which is responsible for avoiding redundant storage of the shared prefix. The given CLOB takes 752KB, while a file written using a UTF-8 character encoding would take 1MB. This would be improved further with real text that contains repeating words or other substrings. Furthermore, there is no directory overhead or wasted space at the end of the last block, so short LOB's take very little space. A LOB with zero characters or bytes takes no space.

Note also that LongObject Writers, Readers, OutputStreams, and InputStreams are very lightweight, and can be used again and again without releasing internal resources. Use setSpaceAndPrefix(ItemSpace space, Cu cuPrefix) to start over again after closing, assuming setKeepResourcesOnClose(boolean keepResourcesOnClose) has been set true.

Storage Format

LOB's are stored using a set of Items with a common prefix, followed by an index component and terminated by char array component or byte array component. Each char or byte array component is 1024 chars long, so the index component's numeric value separates the LOB into 1K blocks. The Items print using Cu.toString() like:
	"My BLOB's unique prefix" [997] {21 37 0.... 0 61}
Where the component in brackets is the index component identifying a block, and the single component in braces is the data 'block' as a byte array component. The last block may be shorter than 1024 bytes or chars. For a CLOB, the Item looks like:
	"My CLOB's unique prefix" [997] {" ... "}

There is no character encoding issue with InfinityDB CLOBs because they are stored in native Java char format as two bytes without being encoded into single bytes by a platform- or encoding-specific character set as they are by WriterOutputStream. Thus there is only one storage format. The inefficiency of the two-byte char is overcome transparently by the underlying universal InfinityDB compression instead.

Previous Next


Copyright © 1997-2006 Boiler Bay.