+++ /dev/null
-This document discusses various issues of indexing
-(creating entries for the inverted file).
-
-
-* the inverted file
-
-The inverted file maps keys to (sorted) sequences of values.
-Although an application can store arbitrary information in values,
-in standard use, they denote a hit (where the key was found in a record).
-A hit is a structure of five numbers, specifying (in that order)
-a database number, a rowid (mfn), tag, occurence (of this tag in it's record)
-and position (of this "word" in it's field).
-
-The inverted file is implemented as a B-Link-Tree (Lehmann/Yao)
-with variable length keys of up to 255 bytes.
-The length of values is fixed for a given index, but can be configured
-on index creation in the range 8 to 23 bytes (8+n, with 4 bits for n).
-The 8 byte minimum configuration (corresponding to the traditional IFP) uses
-- 2 bytes pos
-- 1 byte occ
-- 2 bytes tag
-- 3 bytes mfn
-A ninth byte is used for the mfn, bytes 10 and 11 for a database number,
-additional bytes have no standard usage in hits (by now).
-
-Occurence and position are typically used for "near" conditions in queries.
-Where this is not needed, they could also be used, for example,
-to apply weights to the hits.
-
-
-* creation of index entries
-
-Traditional ISIS systems based indexing on
-- a "print format" to create text from a record
-- an "indexing technique" to extract entries.
- The various indexing techniques split the text at newlines, subfields,
- word boundaries, angle brackets or other delimiters.
-
-OpenIsis uses a different approach based on an "index record",
-which is created either explicitly by application code or using
-> Views a view.
-Most split up like extracting keywords from between angle brackets
-or other delimiters is done during creation of this record,
-containing, in the simplest case, one field per index entry.
-
-The only additional transformation that may be applied
-is word split to be done during
-> Collating
-, because the definition of which characters constitute words is
-configured together with the collation info.
-
-
-In the index record, fields with a non-negative tag are treated as
-index entries according to the current mode.
-All fields with negative tags give processing instructions as follows
-(denoted with symbolic tag names):
-- index f[ields] [startocc]
- an xmode field with value 'f' (or some prefix of 'fields') sets default
- indexing mode, where each value (of a non-negative field) is used as
- literal index entry. The occurence is set to startocc or 0,
- incremented for runs of fields with the same tag,
- and reset to 0 on tag change.
-- index w[ords] [startpos]
- sets word mode, assuming fields to contain pre-split words.
- Instead of the occ, the pos is set, incremented and reset.
-- index s[plit]
- sets word autosplit mode. Index entries are split into words according
- to collation info; each word increments the pos, each field resets the
- pos and increments or resets the occ as in fields mode.
- If the index has no collation info, all characters but the well-known
- ASCII non-letters are assumed to be word characters.
-- index a[dd]|d[el]
- sets add (default) or del mode for index entries.
-- index m[fn] mfn
- sets the mfn to use
-- index n
- where n is a number, sets max key length to n (default is 30)
-- hit [+|-][[dbn.]mfn.]tag.[occ].[pos]|tag[.pos] key
- declares explicit hit data for key.
- With a leading + or -, explicitly add or del the key regardless of mode.
- Unspecified fields are determined as usual from the environment.
- Since search results can be sorted by increasing pos,
- this field can also be used to specify a weight
- (starting from 0 for a 100% match).
-- fst viewspec
- declares a view line to create fields from the current data record, see
-> Views
-
-The index fields can also be used with a tag and value in their field value,
-in which case the given mode applies to that tag and value only.
-(TODO)
-
-While view-based indexing is yet to be done (as of May 03),
-application controlled indexing is working and robust
-(see the save method in openIsis.tcl and sample.fsp).
-
-
-* special values
-
-The entries created according to the index record are then subject to a
-> Collating collation transformation
-to create index keys.
-
-Finally they are handed to the B-Tree for insertion.
-If the key is already there, it may have an associated special value,
-which does not denote a hit, but rather processing instructions.
-Special values are recognized by a mfn of 0.
-
-- a completely blank value of all zeroes marks a stopword.
- no other values for this key are added to the index.
- (This is implemented).
-- a value of mfn 0, first byte of tag 1, marks a ref.
- The rest of the value and some successive values contain an alternate key,
- under which the entry is made (and searched).
- Since this has some difficulties, it will probably take some time
- until this is implemented...
-
-A multilingual (English, French, Italian, Spanish and Portuguese)
-> http://www.bib.wau.nl/isis/docum.txt stopword file
-can be loaded like
-$
-tr '[a-z]' '[A-Z]' <docum.txt | sort | sto/openisis -db db/test/test -swload
-$
-
----
- $Id: Inverting.txt,v 1.3 2003/05/08 13:35:20 kripke Exp $