OpenIsis deploys a B-Tree structure similar to that of CDS/ISIS. There are several differences to the original CDS/ISIS file layout: - The complete index is kept in one file, including control information (.CNT), inner nodes (.N0x) and leave nodes (.L0x) of any key length and "inverted file pointers" (.IFP, see below) - The maximum key length is configurable up to 255 bytes (CDS/ISIS: 30) - The maximum mfn is configurable up to 6 bytes (CDS/ISIS: 3) - The keys are not blank padded. - The blocks have no fixed spread factor (CDS/ISIS: 10), but a fixed size of mK (where m is 1,2,4 or 8), allowing for about 200 keys of 30 bytes. - A custom string comparison function may be used, allowing for proper unicode collation. - To enhance safe high concurrency, it is a B-Link-Tree as used in postgres (see Lehmann/Yao 1982), i.e. each block has a link to it's right sibling. Advantages: - The number of necessary I/O operations is dramatically reduced, typically to 1. Since the invention of B-Tree's by Bayer, the ratio of available main memory to disk storage is ~ 1:100. Since the ratio of the total size of the inner nodes to the total index size is aproximately equal to the spread factor, it is crucial that this is at least 100, allowing all the inner nodes to be in RAM. - The price paid for this, is that, since the inner nodes locate the leaf wanted less precise (up to a block of 200 entries instead of 10), there is more unwanted data read from disk. However, all I/O is in mK blocks on mK boundaries, matching the "page size" of most modern operating systems. This is the basic idea of B-Trees in the first place. - The maximum database size is much raised with regard to the mfn. This allows for an index spanning several databases, where the higher mfn bytes indicate the relevant master file. Limits: The theoretical maximum number of terms is also somewhat raised from 20G (2G leaf blocks per 10 terms) to about 400G (4G blocks per 200 entries). Without support for large files (> 2G), however, the limit of the one-file design may actually even be lower, since the 2G file limits the total size of terms + IFPs. * Basic operation: The B-L-Tree relates keys to "inverted file pointers" (IFP). The key is a byte sequence of a length between 0 and 255. The IFPs are 8 to 11 bytes long, with 3 to 6 bytes for an mfn (rowid), and 5 bytes describing where the key occurs within the given record: 2 field tag, 1 occ(repetition of field), 2 word position. These numbers are in big-endian byte order for easy sorting. The mfn/value length is fixed for a given B-Tree. File structure: The index file consists of consecutive mK blocks, with block number n at file offset n*mK. Block 0 is the root, Block 1 the leftmost leaf. Block structure: The inner structure of a block is similar to an ISIS record: some header fields followed by a dictionary of entries, which is an array of positions (offset into block) and lengths. The dictionary is sorted according to the entry's keys. The actual entry data is located at the end of the block, growing towards the dictionary as entries are added. Entry structure: Each entry starts with len bytes key as stated in the dictionary. For an inner node, it is optionally followed by an IFP, as indicated by a dictionary flag, then by a 4 byte block number. For a leaf node, it is followed by a sorted array of IFPs. Searching: Since it is not uncommon for a key to relate to more distinct IFPs than would fit within one block, we actually consider the pair of key and IFP as the key when searching the tree. When looking for all IFPs for a key, we start with empty (nulled) IFP. Upon modification (insert/delete), we use the given IFP as second key segment. Where the IFPs for one key span several leaf nodes, the separating inner node entry will have it's key augmented with the IFP, so we may quickly locate the proper leaf node. Deleting: The entry is searched and deleted, no restructuring. (Background cleanup may be implemented some day). Inserting: If a new entry fit's within the target leaf, everything is fine. Else either the new entry, or one or more other entries may have to be shifted to the right sibling node. If the right sibling is non-existent or too full (or we didn't yet implement checking it), a new right sibling is created. The new separating key, which is lower than the one we stopped at while locating our node, is then updated or inserted in the parent. Multi-Process Concurrency: There *must not* be more than 1 process having the index open for writing. Readers, however, should work correctly, even in the presence of a writer. Multi-Thread Concurrency: Within a multi-threaded process, all access to internal structures is interlocked within a monitor. The monitor is released during I/O, with the block marked as being read or written from/to disk. A thread wishing to access a block being read, or wishes to write to a block while it is written to disk, must wait on the monitor. Lehmann/Yao Concurrency: While this design should make sure that each block is always in a consistent state, the interrelation between blocks may be disturbed by node splitting or shifting: The key may be found to be no longer in the expected block. Thus, while the search key is greater than the greatest key in the current block, one has to try the right sibling. * Concurrency Details: Within a multi-threaded process, all access to internal structures is interlocked within a monitor. The monitor is released during I/O, with the block marked as being read or written from/to disk. This design will give the best possible over-all utilization on a single-CPU system: letting a given thread run exclusively while it can utilize the CPU. (I.e. unless it's waiting for I/O, swapping shouldn't be an issue, if you're seriously running a db server). If you have a multi-CPU system for nothing but OpenIsis, run secondary read-only servers, each bound to one CPU. This will be by far more efficient than sharing multiple CPUs by one processe's threads. Actually, we deploy one mutex (LOCK/UNLOCK) and one condition (WAIT/WAKE) bound to it. Optionally a set of conditions bound to the same mutex may be used. All access to btree blocks is via a common cache, so that there is never more than one copy of a block in memory. During I/O, the cache block is flagged as being read or written, resp. The read flag is held only temporarily during the I/O; it marks the block contents as being asynchronously changed (by the read call) and thus completely useless. Any thread wishing to access the block has to wait on it. After the read call returns, the thread initiating the read will re-acquire the monitor, clear the flag and wake any waiting threads. The write flag, on the other hand, acts as a WRITE LOCK on the block. Threads wishing read-only access to the block completely ignore it, since the block content is valid and will NOT asynchronously change: it is changed by the writing thread only while it has the monitor, thus no other thread will ever observe a change. A thread wishing to modify the block, however, has to wait, since a modification during an ongoing write could corrupt data. Moreover, the lock need not be released immediatly after the write call returns, but can be held until a second block is also written. * The block reading sequence is as follows: $ r0 find the block in the cache. if it is not found, set it reading, release the monitor, read. else, if it is being read, increase waiter count, wait (releases the monitor). else use it (skip next state). [r1] SUSPENDED wait for either our own read to return or to be waked by another reader. if we return from read, re-acquire the monitor. (if we are awaked, the monitor is re-aqcuired implicitly). r2 if we return from read, clear the reading flag, wake waiting threads. if we are awaked, decrease waiter count. (Since we might have less different conditions than blocks, we have to check the reading flag to see whether we were meant and possibly wait again). we have the block wanted. on searching, if it's not the leaf we wanted, start over. $ NOTE: - the thread starting the read on a block is always the first to get it, since waiters can proceed only after this thread cleared the flag. - for a block that already is in cache, the reader proceeds without releasing the monitor. This especially holds for blocks being written. These properties can be used for a save background cleanup. - a reader does not need to hold a block in the cache while it is suspended in I/O: all processing is immediately after reading. For a secondary reader on a database being written, at least leaf blocks should be invalidated asap. (Otherwise it will obviously miss some updates). * The overall search algorithm: $ s1 starting from the root, locate the first entry with key greater than our key, read the block it pointed to by the previous entry. if there is no such entry found, and the block has a right sibling, check it for possible split or shift (Lehmann/Yao). s2 when reaching a matching entry in a leaf, read subsequent right siblings according to the search range criteria and buffer. $ NOTE: - due to the right sibling checking, it does not lead to inaccurate search results, if inner nodes have missing or too high keys for some childs, it only has a performance penalty (dropping the tree altogether means linear search). * The general block writing sequence: $ w0 reading "for update" as r0, but with additional condition: if it has the WRITE LOCK flag set, wait, else use it (skip next state). [w1] SUSPENDED wait for read or to be waked by other writer. on return from read, re-acquire the monitor. w2 if we are awaked and there's a WRITE LOCK, wait again. (the original reader or some other waiting thread might have set it). after returning from read, set the WRITE LOCK flag, wake waiting readers. if we need to write another block, start over. process block(s). release the monitor, start write call on first block to write. [w3] SUSPENDED wait for the write call to return. repeat while there are more blocks to be written. re-acquire the monitor. w4 clear the WRITE LOCK flag on some blocks, wake waiting threads. if there are more blocks to be processed, start over. $ * The insert algorithm: $ i0 locate the leaf L where to insert the entry. if it has enough room, process it and we're done. i1 if it has a right sibling S, read it for update. if this hasn't enough room to take some entries from L, release the write lock and use a new block S as L's right sibling instead. shift entries to S, insert the new entry into L (or S). i2 write first S, then L to disk. release the write lock on L. i3 read the parent P of S for update, i.e. the parent of L or some block to the right. if we used old S, delete it's old entry. insert the new lower bound on the keys of S. if the parent hasn't enough room, start over at i1 (with P for L). $ when unfolded to synchronized/unsynchronized steps, a simplified version of this might look like: $ read your way down to L until it's read for update... u0 not enough space in L, read S for update [u1] wait for S (L locked) v2 modify both blocks, initiate read of P for update [v3] wait for P, write S then L (L,S locked) v4 release lock on L and S, update P or go to u1 (P locked) $ NOTE: - this is deadlock-free, since locking is ordered left to right, bottom to top. - by always keeping one lock, we avoid conflicts with other writers for the same entries, since those have to follow the same lock path, and thus work always in a serializable manner. - if we are the original reader for P, other threads wanting to read P have to wait for three I/Os. The detailled variant below avoids this. - reading the sibling and therefore the [u1] suspend is optional. depending on the key distribution, it might not hurt much or even be more efficient to always create a new block. - the only inconsistency a reader thread might notice is during v3 (or a later suspend in order to lock P's sibling): there are new versions of L and S, but P is in the old state. The key of S (or L's former sibling) is too high and thus an entry now in S may be searched for in L. In this situation, the reader has to follow L's right link to find the key. There is a slightly more complex variant of insert, wich allows for somewhat enhanced concurreny but with higher synchronization effort: $ u2 modify both blocks (L,S locked) [u3] write S then L u4 release lock on L (S locked), read P for update [u5] wait for P u6 release lock on S, update P or go to u1 (P locked) $ * Multi-Process NOTEs: - we have to assume, that block writes * are atomic (seen by others completely or not at all) and * are ordered (if a later write is seen, an earlier is also). This does NOT hold for the file state on disk, but should hold for the operating system's file cache. - a reader in another process might also read the new version of S but the old version of L and thus has to deal with duplicate entries when searching ranges. - if a reader is caching, and thus write order is not in effect, it might miss entries completely. This does still give correct results, if * no leaves are cached * for all cached blocks, the parent is older This is ok for one-shot readers, or if caching is strictly from top with forced re-read after cache miss on search path. Deleting an entry is trivial, since it affects one block only (if we're lazy). In a single-process multi-threaded environment, however, we can take advantage of certain properties to provide relatively simple and save cleanup synchronously or in background.