Byte and Character Array Components

Previous Next

Array Components

In order to include chunks of raw data in an Item, three 'array' component types are available. These are size-limited sequences of bytes or chars that map directly to Java byte[] or char[]. There is also a 'ByteString' component type.

To append byte[] or char[] is simple:

	char[] chars = new char[100]; 
	// fill the array
	// or
	cu.clear().append(chars, start, length);
You can quickly transfer slices of a byte[] or char[] into and out of slices of a byte array or char array component, you can get the length and so on, but you cannot change the size once it has been appended: you have to truncate away the existing component and re-append to get a new size. The array component being modified does not have to be at the end of the Item currently in the Cu. Byte arrays are stored as chars with the top byte zero in the Item, and char arrays map directly. This leaves us with Items in a Cu having many zeroes in the top bytes of the Cu's internal char array, because Java chars are two-byte UTF-16. However, the upper zeroes disappear when stored in the db because of the UTF-8 encoding transparently applied to all persisted data.

The sorting semantics of the array components is not like String or Items themselves in that the length of the component is more significant than the content. In other words, two arrays with different lengths always sort with the longer later, while identical length array components sort according to the contained bytes or chars, with earlier bytes or chars most significant.

The length of char[] and byte[] components is determined by the maximum Item length of a bit more than 1600 chars, but long arrays can effectively be constructed using Index Components, and long byte and char streams can use CharacterLongObjects and BinaryLongObjects. The latter use index components plus char[] and byte[] components internally .

ByteString Components

In order to get maximum efficiency of raw data in an Item and to have String-like sorting behavior, a ByteString component can be used. The bytes of a ByteString are packed into the chars of the Item, and are encoded so that the array length is less significant than its content. The efficiency is about 86% for the data itself, in relatively long arrays: it cannot be zero because termination information is distributed inside the byte string component. The higher efficiency due to byte packing maximizes cache efficiency, but has no effect on persisted data, because persisted data is already compressed. Unfortunately, it is not possible to transfer slices of external arrays into or out of slices of a byte string component inside an Item as it is with byte[] and char[] components: only the entire component can be tranferred. Such a transfer is relatively fast, but not as fast as for byte[] or char[] components.

There is a ByteString class, but it is only used as a convenience for holding a reference to an internal byte[] with offset and length. When a ByteString is appended, a byte string component is automatically created, and Cu.byteStringAt(int offset) returns a new ByteString.

Raw or Unformatted Items

It is also possible to bypass completely the normal encoding of Items as sequences of components. Items can be created having any content at all, using Cu.setCharAt(int offset, char c) and various Cu.setDirect(..) and Cu.getDirect(..) methods that can use System.arraycopy() for chars or tight loops for bytes, which are packed into upper and lower bytes of the char[] inside the Item. Sometimes it is reasonable to use a prefix of component-formatted data followed by raw data for performance-critical applications that still want flexible structuring for other purposes. The cache efficiency and speed can be maximized, and the transparent ZLib/UTF-8 compression of persisted data is still used.

Strings to and from Arrays

String components do not always need to be appended onto and extracted from a Cu as Objects, but can be handled without construction by transferring directly using char[]'s or other Cu's. A Cu containing the raw text corresponding to a String - a Cu containing plaintext - can be appended as a string component using Cu.appendStringFromPlainText(Cu cu). Also a char[] can be appended using Cu.appendStringFromPlainText(char[] buf, int offset, int length) without construction.

Previous Next

Copyright © 1997-2006 Boiler Bay.