5 CDS/ISIS manual appendix F, G and H
9 This is partial scan of CDS/ISIS manual (appendix F, G and H, pages
10 257-272) which is than converted to text using OCR and proofread.
11 However, there might be mistakes, and any corrections sent to
12 C<dpavlin@rot13.org> will be greatly appreciated.
14 This digital version is made because current version available in ditial
15 form doesn't contain details about CDS/ISIS file format and was essential
16 in making L<Biblio::Isis> module.
18 This extract of manual has been produced in compliance with section (d) of
19 WinIsis LICENCE for receiving institution/person which say:
21 The receiving institution/person may:
23 (d) Print/reproduce the CDS/ISIS manuals or portions thereof,
24 provided that such copies reproduce the copyright notice;
28 This section describes the various files of the CDS/ISIS system, the
29 file naming conventions and the file extensions used for each type of
30 file. All CDS/ISIS files have standard names as follows:
40 is the file name (all file names, except program names, are limited to
41 a maximum of 6 characters)
45 is the file extension identifying a particular type of file.
49 Files marked with C<*> are ASCII files which you may display or print. The
50 other files are binary files.
52 =head2 A. System files
54 System files are common to all CDS/ISIS users and include the various
55 executable programs as well as system menus, worksheets and message
56 files provided by Unesco as well as additional ones which you may
59 =head3 CDS/ISIS Program
61 The name of the program file, as supplied by Unesco is
65 Depending on the release and/or target computer, there may also be one
66 or more overlay files. These, if present, have the extension C<OVL>.
67 Check the contents of your system diskettes or tape to see whether
68 overlay files are present.
70 =head3 System menus and worksheets
72 All system menus and worksheets have the file extension FMT and the
73 names are built as follows:
83 is the page number (A for the first page, B for the second, etc.)
87 is the language code (e.g. E for English), which must be one of those
88 provided for in the language selection menu xXLNG.
92 is X for menus and Y for system worksheets
96 is a unique identifier
100 For example the full name of the English version of the menu xXGEN is
103 The page number is transparent to the CDS/ISIS user. Like the file
104 extension the page number is automatically provided by the system.
105 Therefore when a CDS/ISIS program prompts you to enter a menu or
106 worksheet name you must not include the page number. Furthermore as
107 file names are restricted to 6 characters, menus and worksheets names
108 may not be longer than 5 characters.
110 System menus and worksheets may only have one page.
112 The language code is mandatory for system menus and standard system
113 worksheets. For example if you want to link a HELP menu to the system
114 menu EXGEN, its name must begin with the letter E.
116 The B<X> convention is only enforced for standard system menus. It is a
117 good practice, however, to use the same convention for menus that you
118 create, and to avoid creating worksheets (including data entry
119 worksheets) with X in this position, that is with names like xB<X>xxx.
121 Furthermore, if a data base name contains B<X> or B<Y> in the second
122 position, then the corresponding data entry worksheets will be created
123 in the system worksheet directory (parameter 2 of C<SYSPAR.PAR>) rather
124 then the data base directory. Although this will not prevent normal
125 operation of the data base, it is not recommended.
127 =head3 System messages files
129 System messages and prompts are stored in standard CDS/ISIS data bases.
130 All corresponding data base files (see below) are required when
131 updating a message file, but only the Master file is used to display
134 There must be a message data base for each language supported through
135 the language selection menu xXLNG.
137 The data base name assigned to message data bases is xMSG (where x is
142 System tables are used by CDS/ISIS to define character sets. Two are
149 defines lower to upper-case translation
153 defines the alphabetic characters.
157 =head3 System print and work files
159 Certain CDS/ISIS print functions do not send the output directly to the
160 printer but store it on a disk file from which you may then print it at
161 a convenient time. These files have all the file extension C<LST> and
162 are reused each time the corresponding function is executed.
164 In addition CDS/ISIS creates temporary work files which are normally
165 automatically discarded at the end of the session. If the session
166 terminates abnormally, however, they will not be deleted. A case of
167 abnormal termination would be a power failure while you are using a
168 CDS/ISIS program. Also these files, however, are reused each time,
169 so that you do not normally need to delete them manually. Work files
170 all have the extension C<TMP>.
172 The print and work files created by CDS/ISIS are given below:
178 Inverted file listing file (produced by ISISINV)
182 Worksheet/menu listing file (produced by ISISUTL)
186 System messages listing file (produced by ISISUTL)
190 Printed output (produced by ISISPRT when printing no print file name is
227 Trace file created by certain programs
231 Temporary storage for hit lists created during retrieval
235 Temporary storage for search expressions
239 =head2 B. Data Base files
245 mandatory files, which must always be present.
246 These are normally established when the data base is defined by means of the
247 ISISDEF services and should never be deleted;
251 auxiliary files created by the system whenever certain functions are
253 These can periodically be deleted when they are no longer needed.
257 user files created by the data base user (such as display formats),
258 which are fully under the user's responsibility.
262 Each data base consists of a number of physically distinct files as
263 indicated below. There are three categories of data base files:
265 In the following description C<xxxxxx> is the 1-6 character data base
268 =head3 Mandatory data base files
274 Field Definition Table
278 Field Select Table for Inverted file
282 Default data entry worksheet (where p is the page number).
284 Note that the data base name is truncated to 5 characters if necessary
288 Default display format
296 Crossreference file (Master file index)
300 B*tree (search term dictionary) control file
304 B*tree Nodes (for terms up to 10 characters long)
308 B*tree Leafs (for terms up to 10 characters long)
312 B*tree Nodes (for terms longer than 10 characters)
316 B*tree Leafs (for terms longer than 10 characters)
320 Inverted file postings
328 =head3 Auxiliary files
334 Stopword file used during inverted file generation
338 Unsorted Link file (short terms)
342 Unsorted Link file (long terms)
346 Sorted Link file (short terms)
350 Sorted Link file (long terms)
366 Sort convertion table (see "Uppercase conversion table (1SISUC.TAB)" on
377 Field Select tables used for sorting
381 Additional display formats
385 Additional data entry worksheets
389 Additional stopword files
393 Save files created during retrieval
397 The name of user files is fully under user control. However, in order
398 to avoid possible name conflicts it is advisable to establish some
399 standard conventions to be followed by all CDS/ISIS users at a given
400 site, such as for example to define C<yyyyyy> as follows:
410 is a data base identifier (which could be the first three letters of
411 the data base name if no two data bases names are allowed to begin with
412 the same three letters)
420 =head1 Master file structure and record format
422 =head2 A. Master file record format
424 The Master record is a variable length record consisting of three
425 sections: a fixed length leader; a directory; and the variable length
430 The leader consists of the following 7 integers (fields marked with *
431 are 31-bit signed integers):
441 Record length (always an even number)
445 Backward pointer - Block number
449 Backward pointer - Offset
453 Offset to variable fields (this is the combined length of the Leader
454 and Directory part of the record, in bytes)
458 Number of fields in the record (i.e. number of directory entries)
462 Logical deletion indicator (0=record active; 1=record marked for
467 C<MFBWB> and C<MFBWP> are initially set to 0 when the record is
468 created. They are subsequently updated each time the record itself is
471 =head3 Directory format
473 The directory is a table indicating the record contents. There is one
474 directory entry for each field present in, the record (i.e. the
475 directory has exactly NVF entries). Each directory entry consists of 3
486 Offset to first character position of field in the variable field
487 section (the first field has C<POS=0>)
491 Field length in bytes
495 The total directory length in bytes is therefore C<6*NVF>; the C<BASE> field
496 in the leader is always: C<18+6*NVF>.
498 =head3 Variable fields
500 This section contains the data fields (in the order indicated by the
501 directory). Data fields are placed one after the other, with no
502 separating characters.
504 =head2 B. Control record
506 The first record in the Master file is a control record which the
507 system maintains automatically. This is never accessible to the ISIS
508 user. Its contents are as follows (fields marked with C<*> are 31-bit
519 MFN to be assigned to the next record created in the data base
523 Last block number allocated to the Master file (first block is 1)
527 Offset to next available position in last block
531 always 0 for user data base file (1 for system message files)
535 (the last four fields are used for statistics during backup/restore).
537 =head2 C. Master file block format
539 The Master file records are stored consecutively, one after the other,
540 each record occupying exactly C<MFRL> bytes. The file is stored as
541 physical blocks of 512 bytes. A record may begin at any word boundary
542 between 0-498 (no record begins between 500-510) and may span over two
545 As the Master file is created and/or updated, the system maintains an
546 index indicating the position of each record. The index is stored in
547 the Crossreference file (C<.XRF>)
549 =head2 D. Crossreference file
551 The C<XRF> file is organized as a table of pointers to the Master file.
552 The first pointer corresponds to MFN 1, the second to MFN 2, etc.
554 Each pointer consists of two fields:
568 (21 bits) Block number of Master file block containing the record
572 (11 bits) Offset in block of first character position of Master record
573 (first block position is 0)
577 which are stored in a 31-bit signed integer (4 bytes) as follows:
579 pointer = XRFMFB * 2048 + XRFMFP
581 (giving therefore a maximum Master file size of 500 Megabytes).
583 Each block of the C<XRF> file is 512 bytes and contains 127 pointers. The
584 first field in each block (C<XRFPOS>) is a 31-bit signed integer whose
585 absolute value is the C<XRF> block number. A negative C<XRFPOS> indicates
588 I<Deleted> records are indicated as follows:
592 =item C<XRFMFB E<lt> 0> and C<XRFMFP E<gt> 0>
594 logically deleted record (in this case C<ABS(XRFMFB)> is the correct block
595 pointer and C<XRFMFP> is the offset of the record, which can therefore
598 =item C<XRFMFB = -1> and C<XRFMFP = 0>
600 physically deleted record
602 =item C<XRFMFB = 0> and C<XRFMFP = 0>
604 inexistent record (all records beyond the highest C<MFN> assigned in the
609 =head2 E. Master file updating technique
611 =head3 Creation of new records
613 New records are always added at the end of the Master file, at the
614 position indicated by the fields C<NXTMFB>/C<NXTMFP> in the Master file
615 control record. The C<MFN> to be assigned is also obtained from the field
616 C<NXTMFN> in the control record.
618 After adding the record, C<NXTMFN> is increased by 1 and C<NXTMFB>/C<NXTMFP>
619 are updated to point to the next available position. In addition a new
620 pointer is created in the C<XRF> file and the C<XRFMFP> field corresponding
621 to the record is increased by 1024 to indicate that this is a new
622 record to be inverted (after the inversion of the record 1024 is
623 subtracted from C<XRFMFP>).
625 =head3 Update of existing records
627 Whenever you update a record (i.e., you call it in data entry and exit
628 with option X from the editor) the system writes the record back to the
629 Master file. Where it is written depends on the status of the record
630 when it was initially read.
632 =head4 There was no inverted file update pending for the record
634 This condition is indicated by the following:
636 On C<XRF> C<XRFMFP E<lt> 512> and
638 On C<MST> C<MFBWB = 0> and C<MFBWP = 0>
640 In this case, the record is always rewritten at the end of the Master
641 file (as if it were a new record) as indicated by C<NXTMFB>/C<NXTMFP> in the
642 control record. In the new version of the record C<MFBWB>/C<MFBWP> are set to
643 point to the old version of the record, while in the C<XRF> file the
644 pointer points to the new version. In addition 512 is added to C<XRFMFP>
645 to indicate that an inverted file update is pending. When the inverted
646 file is updated, the old version of the record is used to determine the
647 postings to be deleted and the new version is used to add the new
648 postings. After the update of the Inverted file, 512 is subtracted from
649 C<XRFMFP>, and C<MFBWB>/C<MFBWP> are reset to 0.
651 =head4 An inverted file update was pending
653 This condition is indicated by the following:
655 On C<XRF> C<XRFMFP E<gt> 512> and
657 On C<MST> C<MFBWB E<gt> 0>
659 In this case C<MFBWB>/C<MFBWP> point to the version of the record which is
660 currently reflected in the Inverted file. If possible, i.e. if the
661 record length was not increased, the record is written back at its
662 original location, otherwise it is written at the end of the file. In
663 both cases, C<MFBWB>/C<MFBWP> are not changed.
665 =head3 Deletion of records
667 Record deletion is treated as an update, with the following additional
670 On C<XRF> C<XRFMFB> is negative
672 On C<MST> C<STATUS> is set to 1
674 =head2 F. Master file reorganization
676 As indicated above, as Master file records are updated the C<MST> file
677 grows in size and there will be lost space in the file which cannot be
678 used. The reorganization facilities allow this space to be reclaimed by
679 recompacting the file.
681 During the backup phase a Master file backup file is created (C<.BKP>).
682 The structure and format of this file is the same as the Master file
683 (C<.MST>), except that a Crossreference file is not required as all the
684 records are adjacent. Records marked for deletion are not backed up.
685 Because only the latest copy of each record is backed up, the system
686 does not allow you to perform a backup whenever an Inverted file update
687 is pending for one or more records.
689 During the restore phase the backup file is read sequentially and the
690 program recreates the C<MST> and C<XRF> file. At this point alt records which
691 were marked for logical deletion (before the backup) are now marked as
692 physically deleted (by setting C<XRFMFB = -1> and C<XRFMFP = 0>.
693 Deleted records are detected by checking holes in the C<MFN> numbering.
695 =head1 Inverted file structure and record formats
697 =head2 A. Introduction
699 The CDS/ISIS Inverted file consists of six physical files, five of
700 which contain the dictionary of searchable terms (organized as a
701 B*tree) and the sixth contains the list of postings associated with
702 each term. In order to optimize disk storage, two separate B*trees are
703 maintained, one for terms of up to 10 characters (stored in files
704 C<.N01>/C<.L01>) and one for terms longer than 10 characters, up to a maximum
705 of 30 characters (stored in files C<.N02>/C<.L02>). The file C<CNT> contains
706 control fields for both B*trees. In each B*tree the file C<.N0x> contains
707 the nodes of the tree and the C<.L0x> file contains the leafs. The leaf
708 records point to the postings file C<.IFP>.
710 The relationship between the various files is schematically represented
713 The physical relationship between these six files is a
714 pointer, which represents the relative address of the record being
715 pointed to. A relative address is the ordinal record number of a record
716 in a given file (i.e. the first record is record number 1, the second
717 is record number 2, etc.). The file C<.CNT> points to the file C<.N0x>,
718 C<.N0x> points to C<.L0x>, and C<.L0x> points to C<.IFP>. Because the
719 C<.IFP> is a packed file, the pointer from C<.L0x> to C<.IFP> has two
720 components: the block number and the offset within the block, each expressed
723 =head2 B. Format of C<.CNT> file
725 This file contain two 26-byte fixed length records (one for each
726 B*tree) each containing 10 integers as follows (fields marked with *
727 are 31-bit signed integers):
733 B*tree type (1 for C<.N01>/C<.L01>, 2 for C<.N02>/C<.L02>)
737 Nodes order (each C<.N0x> record contains at most C<2*ORDN> keys)
741 Leafs order (each C<.L0x> record contains at most C<2*ORDF> keys)
745 Number of memory buffers allocated for nodes
749 Number of buffers allocated to lst level index (C<K E<lt> N>)
753 Current number of index levels
757 Pointer to Root record in C<.N0x>
761 Next available position in C<.N0x> file
765 Next available position in C<.L0x> file
769 Formal B*tree normality indicator (0 if B*tree is abnormal, 1 if B*tree
770 is normal). A B*tree is abnormal if the nodes file C<.N0x> contains only
775 C<ORDN>, C<ORDF>, C<N> and C<K> are fixed for a given generated system.
776 Currently these values are set as follows:
778 C<ORDN = 5>; C<ORDF = 5>; C<N = 15>; C<K = 5> for both B*trees
786 +-----------V--------+
787 | Key1 Key2 ... Keyn | Root
788 +---|-------------|--+
792 +----------V----------+ +---------V----------+ 1st level
793 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
794 +--|------------------+ +-----------------|--+
798 +--V------------------+ +---------V----------+ last level
799 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn | index
800 +---------|-----------+ +---------|----------+
804 +---------V-----------+ +---------V----------+
805 | Key1 Key2 ... Keyn | ... | Key1 Key2 ... Keyn |
806 +--|------------------+ +--------------------+
810 +--V----------------------------------+
811 | P1 P2 P3 ..................... Pn |
812 +-------------------------------------+
814 I<Figure 67: Inverted file structure>
816 The other values are set as required when the B*trees are generated.
818 =head2 C. Format of C<.N0x> files
820 These files contain the indexes) of the dictionary of searchable terms
821 (C<.N01> for terms shorter than 11 characters and C<.N02> for terms longer
822 than 10 characters). The C<.N0x> file records have the following format
823 (fields marked with * are 31-bit signed integers):
829 an integer indicating the relative record number (1 for the first
830 record, 2 for the second record, etc.)
834 an integer indicating the number of active keys in the record
835 ( C<1 E<lt>= OCK E<lt>= 2*ORDN> )
839 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
843 an array of C<ORDN> entries (C<OCK> of which are active), each having the
850 a fixed length character string of length C<.LEx> (C<LE1 =10>, C<LE2 = 30>)
854 a pointer to the C<.N0x> record (if C<PUNT E<gt> 0>) or C<.L0x> record
855 (if C<PUNT E<lt> 0>) whose C<IDX(1).KEY = KEY>. C<PUNT = 0> indicates
856 an inactive entry. A positive C<PUNT> indicates a branch to a hierarchically
857 lower level index. The lowest level index (C<PUNT E<lt> 0>) points the leafs in
864 =head2 D. Format of C<.L0x> files
866 These files contain the full dictionary of searchable terms (C<.L01> for
867 terms shorter than 11 characters and C<.L02> for terms longer than 10
868 characters). The C<.L0x> file records have the following format (fields
869 marked with C<*> are 31-bit signed integers):
875 an integer indicating the relative record number (1 for the first
876 record, 2 for the second record, etc.)
880 an integer indicating the number of active keys in the record
881 (C<1 E<lt> OCK E<lt>= 2*ORDF>)
885 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
889 is the immediate successor of C<IDX[OCK].KEY> in this record (this is used
890 to speed up sequential access to the file)
894 an array of C<ORDN> entries (C<OCK> of which are active), each having the
901 a fixed length character string of length C<LEx> (C<LE1=10>, C<LE2=30>)
905 a pointer to the C<.IFP> record where the list of postings associated with
906 C<KEY> begins. This pointer consists of two 31-bit signed integers as
913 relative block number in C<.IFP>
917 offset (word number relative to 0) to postings list
925 =head2 E. Format of C<.IFP> file
927 This file contains the list of postings for each dictionary term. Each
928 list of postings has the format indicated below. The file is structured
929 in blocks of 512 characters, where (for an initially loaded and
930 compacted file) the lists of postings for each term are adjacent,
931 except as noted below.
933 The general format of each block is:
939 a 31-bit signed integer indicating the Block number of this block
940 (blocks are numbered from 1)
944 An array of 127 31-bit signed integers
948 C<IFPREC[1]> and C<FPREC[2]> of the first block are a pointer to the
949 next available position in the C<.IFP> file.
951 Pointers from C<.L0x> to C<.IFP> and pointers within C<.IFP> consist of two
952 31-bit signed integers: the first integer is a block number, and the
953 second integer is a word offset in C<IFPREC> (e.g. the offset to the
954 first word in C<IFPREC> is 0). The list of postings associated with the
955 first search term will therefore start at 1/0.
957 Each list of postings consists of a header (5 double-words) followed by
958 the actual list of postings (8 bytes for each posting). The header has
959 the following format (each field is a 31-bit signed integer):
965 Pointer to next segment (Block number)
969 Pointer to next segment (offset)
973 Total number of postings (accurate only in first segment)
977 Number of postings in this segment (C<IFPSEGP E<lt>= IFPTOTP>)
981 Segment capacity (i.e. number of postings which can be stored in this
986 Each posting is a 64-bit string partitioned as follows:
992 (24 bits) Master file number
996 (16 bits) Field identifier (assigned from the C<FST>)
1000 (8 bits) Occurrence number
1004 (16 bits) Term sequence number in field
1008 Each field is stored in a strict left-to-right sequence with leading
1009 zeros added if necessary to adjust the corresponding bit string to the
1010 right (this allows comparisons of two postings as character strings).
1012 The list of postings is stored in ascending C<PMFN>/C<PTAG>/C<POCC>/C<PCNT>
1013 sequence. When the inverted file is loaded sequentially (e.g. after a
1014 full inverted file generation with ISISINV), each list consists of one
1015 or more adjacent segments. If C<IFPTOT E<lt>= 32768> then:
1016 C<IFPNXTB/IFPNXTP = 0/0> and C<IFPTOT = IFPSEGP = IFPSEGC>.
1018 As updates are performed, additional segments may be created whenever
1019 new postings must be added. In this case a new segment with capacity
1020 C<IFPTOTP> is created and linked to other segments (through the pointer
1021 C<IFPNXTB>/C<IFPNXTP>) in such a way that the sequence
1022 C<PMFN>/C<PTAG>/C<POCC>/C<PCNT> is maintained. Whenever such a split occurs
1023 the postings of the segment where the new posting should have been inserted
1024 are equally distributed between this segment and the newly created segment.
1025 New segments are always written at the end of the file (which is maintained
1026 in C<IFPREC[1]>/C<IFPREC[2]> of the first C<.IFP> block.
1028 For example, assume that a new posting C<Px> has to be inserted between C<P2>
1029 and C<P3> in the following list:
1031 +----------------------------+
1032 | 0 0 5 5 5 | P1 P2 P3 P4 P5 |
1033 +----------------------------+
1035 after the split (and assuming that the next available position in C<.IFP>
1036 is 3/4) the list of postings will consist of the following two segments:
1038 +----------------------------+
1039 | 3 4 5 3 5 | P2 P2 Px -- -- |
1040 +--|-------------------------+
1042 +--V-------------------------+
1043 | 0 0 5 3 5 | P3 P4 P5 -- -- |
1044 +----------------------------+
1046 In this situation, no new segment will be created until either segment
1049 As mentioned above, the posting lists are normally stored one after the
1050 other. However, in order to facilitate access to the C<.IFP> file the
1051 segments are stored in such a way that:
1057 the header and the first posting in each list (28 bytes) are never
1058 split between two blocks.
1062 a posting is never split between two blocks; if there is not enough
1063 room in the current block the whole posting is stored in the next
1070 UNESCO has developed and owns the intellectual property of the CDS/ISIS
1071 software (in whole or in part, including all files and documentation, from
1072 here on referred to as CDS/ISIS) for the storage and retrieval of
1075 For complete text of licence visit
1076 L<http://www.unesco.org/isis/files/winisislicense.html>.