git.rot13.org Git - Biblio-Isis/blob - lib/Biblio/Isis/Manual.pod

   1 =pod
   2
   3 =head1 NAME
   4
   5 CDS/ISIS manual appendix F, G and H
   6
   7 =head1 DESCRIPTION
   8
   9 This is partial scan of CDS/ISIS manual (appendix F, G and H, pages
  10 257-272) which is than converted to text using OCR and proofread.
  11 However, there might be mistakes, and any corrections sent to
  12 C<dpavlin@rot13.org> will be greatly appreciated.
  13
  14 This digital version is made because current version available in ditial
  15 form doesn't contain details about CDS/ISIS file format and was essential
  16 in making L<Biblio::Isis> module.
  17
  18 This extract of manual has been produced in compliance with section (d) of
  19 WinIsis LICENCE for receiving institution/person which say:
  20
  21  The receiving institution/person may:
  22
  23  (d) Print/reproduce the CDS/ISIS manuals or portions thereof,
  24      provided that such copies reproduce the copyright notice;
  25
  26 =head1 CDS/ISIS Files
  27
  28 This section describes the various files of the CDS/ISIS system, the
  29 file naming conventions and the file extensions used for each type of
  30 file. All CDS/ISIS files have standard names as follows:
  31
  32   nnnnnn.eee
  33
  34 where:
  35
  36 =over 10
  37
  38 =item C<nnnnnn>
  39
  40 is the file name (all file names, except program names, are limited to
  41 a maximum of 6 characters)
  42
  43 =item C<.eee>
  44
  45 is the file extension identifying a particular type of file.
  46
  47 =back
  48
  49 Files marked with C<*> are ASCII files which you may display or print. The
  50 other files are binary files.
  51
  52 =head2 A. System files
  53
  54 System files are common to all CDS/ISIS users and include the various
  55 executable programs as well as system menus, worksheets and message
  56 files provided by Unesco as well as additional ones which you may
  57 create.
  58
  59 =head3 CDS/ISIS Program
  60
  61 The name of the program file, as supplied by Unesco is
  62
  63   ISIS.EXE
  64
  65 Depending on the release and/or target computer, there may also be one
  66 or more overlay files. These, if present, have the extension C<OVL>.
  67 Check the contents of your system diskettes or tape to see whether
  68 overlay files are present.
  69
  70 =head3 System menus and worksheets
  71
  72 All system menus and worksheets have the file extension FMT and the
  73 names are built as follows:
  74
  75   pctnnn.FMT
  76
  77 where:
  78
  79 =over 10
  80
  81 =item C<p>
  82
  83 is the page number (A for the first page, B for the second, etc.)
  84
  85 =item C<c>
  86
  87 is the language code (e.g. E for English), which must be one of those
  88 provided for in the language selection menu xXLNG.
  89
  90 =item C<t>
  91
  92 is X for menus and Y for system worksheets
  93
  94 =item C<nnn>
  95
  96 is a unique identifier
  97
  98 =back
  99
 100 For example the full name of the English version of the menu xXGEN is
 101 C<AEXGEN.FMT>.
 102
 103 The page number is transparent to the CDS/ISIS user. Like the file
 104 extension the page number is automatically provided by the system.
 105 Therefore when a CDS/ISIS program prompts you to enter a menu or
 106 worksheet name you must not include the page number. Furthermore as
 107 file names are restricted to 6 characters, menus and worksheets names
 108 may not be longer than 5 characters.
 109
 110 System menus and worksheets may only have one page.
 111
 112 The language code is mandatory for system menus and standard system
 113 worksheets. For example if you want to link a HELP menu to the system
 114 menu EXGEN, its name must begin with the letter E.
 115
 116 The B<X> convention is only enforced for standard system menus. It is a
 117 good practice, however, to use the same convention for menus that you
 118 create, and to avoid creating worksheets (including data entry
 119 worksheets) with X in this position, that is with names like xB<X>xxx.
 120
 121 Furthermore, if a data base name contains B<X> or B<Y> in the second
 122 position, then the corresponding data entry worksheets will be created
 123 in the system worksheet directory (parameter 2 of C<SYSPAR.PAR>) rather
 124 then the data base directory. Although this will not prevent normal
 125 operation of the data base, it is not recommended.
 126
 127 =head3 System messages files
 128
 129 System messages and prompts are stored in standard CDS/ISIS data bases.
 130 All corresponding data base files (see below) are required when
 131 updating a message file, but only the Master file is used to display
 132 messages.
 133
 134 There must be a message data base for each language supported through
 135 the language selection menu xXLNG.
 136
 137 The data base name assigned to message data bases is xMSG (where x is
 138 the language code).
 139
 140 =head3 System tables
 141
 142 System tables are used by CDS/ISIS to define character sets. Two are
 143 required at present:
 144
 145 =over
 146
 147 =item C<ISISUC.TAB>*
 148
 149 defines lower to upper-case translation
 150
 151 =item C<ISISAC.TAB>*
 152
 153 defines the alphabetic characters.
 154
 155 =back
 156
 157 =head3 System print and work files
 158
 159 Certain CDS/ISIS print functions do not send the output directly to the
 160 printer but store it on a disk file from which you may then print it at
 161 a convenient time. These files have all the file extension C<LST> and
 162 are reused each time the corresponding function is executed.
 163
 164 In addition CDS/ISIS creates temporary work files which are normally
 165 automatically discarded at the end of the session. If the session
 166 terminates abnormally, however, they will not be deleted. A case of
 167 abnormal termination would be a power failure while you are using a
 168 CDS/ISIS program. Also these files, however, are reused each time,
 169 so that you do not normally need to delete them manually. Work files
 170 all have the extension C<TMP>.
 171
 172 The print and work files created by CDS/ISIS are given below:
 173
 174 =over
 175
 176 =item C<IFLIST.LST>*
 177
 178 Inverted file listing file (produced by ISISINV)
 179
 180 =item C<WSLIST.LST>*
 181
 182 Worksheet/menu listing file (produced by ISISUTL)
 183
 184 =item C<xMSG.LST>*
 185
 186 System messages listing file (produced by ISISUTL)
 187
 188 =item C<x.LST>*
 189
 190 Printed output (produced by ISISPRT when printing no print file name is
 191 supplied)
 192
 193 =item C<SORTIO.TMP>
 194
 195 Sort work file 1
 196
 197 =item C<SORTII.TMP>
 198
 199 Sort work file 2
 200
 201 =item C<SORTI2.TMP>
 202
 203 Sort work file 3
 204
 205 =item C<SORTI3.TMP>
 206
 207 Sort work file 4
 208
 209 =item C<SORT20.TMP>
 210
 211 Sort work file 5
 212
 213 =item C<SORT2I.TMP>
 214
 215 Sort work file 6
 216
 217 =item C<SORT22.TMP>
 218
 219 Sort work file 7
 220
 221 =item C<SORT23.TMP>
 222
 223 Sort work file 8
 224
 225 =item C<TRACE.TMP>*
 226
 227 Trace file created by certain programs
 228
 229 =item C<ATSF.TMP>
 230
 231 Temporary storage for hit lists created during retrieval
 232
 233 =item C<ATSQ.TMP>
 234
 235 Temporary storage for search expressions
 236
 237 =back
 238
 239 =head2 B. Data Base files
 240
 241 =over
 242
 243 =item 1
 244
 245 mandatory files, which must always be present.
 246 These are normally established when the data base is defined by means of the
 247 ISISDEF services and should never be deleted;
 248
 249 =item 2
 250
 251 auxiliary files created by the system whenever certain functions are
 252 performed.
 253 These can periodically be deleted when they are no longer needed.
 254
 255 =item 3
 256
 257 user files created by the data base user (such as display formats),
 258 which are fully under the user's responsibility.
 259
 260 =back
 261
 262 Each data base consists of a number of physically distinct files as
 263 indicated below. There are three categories of data base files:
 264
 265 In the following description C<xxxxxx> is the 1-6 character data base
 266 name.
 267
 268 =head3 Mandatory data base files
 269
 270 =over
 271
 272 =item C<xxxxxx.FDT>*
 273
 274 Field Definition Table
 275
 276 =item C<xxxxxx.FST>*
 277
 278 Field Select Table for Inverted file
 279
 280 =item C<xxxxxx.FMT>*
 281
 282 Default data entry worksheet (where p is the page number).
 283
 284 Note that the data base name is truncated to 5 characters if necessary
 285
 286 =item C<xxxxxx.PFT>*
 287
 288 Default display format
 289
 290 =item C<xxxxxx.MST>
 291
 292 Master file
 293
 294 =item C<xxxxxx.XRF>
 295
 296 Crossreference file (Master file index)
 297
 298 =item C<xxxxxx.CNT>
 299
 300 B*tree (search term dictionary) control file
 301
 302 =item C<xxxxxx.N01>
 303
 304 B*tree Nodes (for terms up to 10 characters long)
 305
 306 =item C<xxxxxx.L01>
 307
 308 B*tree Leafs (for terms up to 10 characters long)
 309
 310 =item C<xxxxxx.N02>
 311
 312 B*tree Nodes (for terms longer than 10 characters)
 313
 314 =item C<xxxxxx.L02>
 315
 316 B*tree Leafs (for terms longer than 10 characters)
 317
 318 =item C<xxxxxx.IFP>
 319
 320 Inverted file postings
 321
 322 =item C<xxxxxx.ANY>*
 323
 324 ANY file
 325
 326 =back
 327
 328 =head3 Auxiliary files
 329
 330 =over
 331
 332 =item C<xxxxx.STW>*
 333
 334 Stopword file used during inverted file generation
 335
 336 =item C<xxxxxx.LN1>*
 337
 338 Unsorted Link file (short terms)
 339
 340 =item C<xxxxxx.LN2>*
 341
 342 Unsorted Link file (long terms)
 343
 344 =item C<xxxxxx.LKl>*
 345
 346 Sorted Link file (short terms)
 347
 348 =item C<xxxxxx.LK2>*
 349
 350 Sorted Link file (long terms)
 351
 352 =item C<xxxxxx.BKP>
 353
 354 Master file backup
 355
 356 =item C<xxxxxx.XHF>
 357
 358 Hit file index
 359
 360 =item C<xxxxxx.HIT>
 361
 362 Hit file
 363
 364 =item C<xxxxxx.SRT>*
 365
 366 Sort convertion table (see "Uppercase conversion table (1SISUC.TAB)" on
 367 page 227)
 368
 369 =back
 370
 371 =head3 User files
 372
 373 =over
 374
 375 =item C<yyyyyy.FST>*
 376
 377 Field Select tables used for sorting
 378
 379 =item C<yyyyyy.PFT>*
 380
 381 Additional display formats
 382
 383 =item C<yyyyyy.FMT>*
 384
 385 Additional data entry worksheets
 386
 387 =item C<yyyyyy.STW>*
 388
 389 Additional stopword files
 390
 391 =item C<yyyyyy.SAV>
 392
 393 Save files created during retrieval
 394
 395 =back
 396
 397 The name of user files is fully under user control. However, in order
 398 to avoid possible name conflicts it is advisable to establish some
 399 standard conventions to be followed by all CDS/ISIS users at a given
 400 site, such as for example to define C<yyyyyy> as follows:
 401
 402   xxxyyy
 403
 404 where:
 405
 406 =over
 407
 408 =item C<xxx>
 409
 410 is a data base identifier (which could be the first three letters of
 411 the data base name if no two data bases names are allowed to begin with
 412 the same three letters)
 413
 414 =item C<yyy>
 415
 416 a user chosen name.
 417
 418 =back
 419
 420 =head1 Master file structure and record format
 421
 422 =head2 A. Master file record format
 423
 424 The Master record is a variable length record consisting of three
 425 sections: a fixed length leader; a directory; and the variable length
 426 data fields.
 427
 428 =head3 Leader format
 429
 430 The leader consists of the following 7 integers (fields marked with *
 431 are 31-bit signed integers):
 432
 433 =over
 434
 435 =item C<MFN>*
 436
 437 Master file number
 438
 439 =item C<MFRL>
 440
 441 Record length (always an even number)
 442
 443 =item C<MFBWB>*
 444
 445 Backward pointer - Block number
 446
 447 =item C<MFBWP>
 448
 449 Backward pointer - Offset
 450
 451 =item C<BASE>
 452
 453 Offset to variable fields (this is the combined length of the Leader
 454 and Directory part of the record, in bytes)
 455
 456 =item C<NVF>
 457
 458 Number of fields in the record (i.e. number of directory entries)
 459
 460 =item C<STATUS>
 461
 462 Logical deletion indicator (0=record active; 1=record marked for
 463 deletion)
 464
 465 =back
 466
 467 C<MFBWB> and C<MFBWP> are initially set to 0 when the record is
 468 created. They are subsequently updated each time the record itself is
 469 updated (see below).
 470
 471 =head3 Directory format
 472
 473 The directory is a table indicating the record contents. There is one
 474 directory entry for each field present in, the record (i.e. the
 475 directory has exactly NVF entries). Each directory entry consists of 3
 476 integers:
 477
 478 =over
 479
 480 =item C<TAG>
 481
 482 Field Tag
 483
 484 =item C<POS>
 485
 486 Offset to first character position of field in the variable field
 487 section (the first field has C<POS=0>)
 488
 489 =item C<LEN>
 490
 491 Field length in bytes
 492
 493 =back
 494
 495 The total directory length in bytes is therefore C<6*NVF>; the C<BASE> field
 496 in the leader is always: C<18+6*NVF>.
 497
 498 =head3 Variable fields
 499
 500 This section contains the data fields (in the order indicated by the
 501 directory). Data fields are placed one after the other, with no
 502 separating characters.
 503
 504 =head2 B. Control record
 505
 506 The first record in the Master file is a control record which the
 507 system maintains automatically. This is never accessible to the ISIS
 508 user. Its contents are as follows (fields marked with C<*> are 31-bit
 509 signed integers):
 510
 511 =over
 512
 513 =item C<CTLMFN>*
 514
 515 always 0
 516
 517 =item C<NXTMFN>*
 518
 519 MFN to be assigned to the next record created in the data base
 520
 521 =item C<NXTMFB>*
 522
 523 Last block number allocated to the Master file (first block is 1)
 524
 525 =item C<NXTMFP>
 526
 527 Offset to next available position in last block
 528
 529 =item C<MFTYPE>
 530
 531 always 0 for user data base file (1 for system message files)
 532
 533 =back
 534
 535 (the last four fields are used for statistics during backup/restore).
 536
 537 =head2 C. Master file block format
 538
 539 The Master file records are stored consecutively, one after the other,
 540 each record occupying exactly C<MFRL> bytes. The file is stored as
 541 physical blocks of 512 bytes. A record may begin at any word boundary
 542 between 0-498 (no record begins between 500-510) and may span over two
 543 or more blocks.
 544
 545 As the Master file is created and/or updated, the system maintains an
 546 index indicating the position of each record. The index is stored in
 547 the Crossreference file (C<.XRF>)
 548
 549 =head2 D. Crossreference file
 550
 551 The C<XRF> file is organized as a table of pointers to the Master file.
 552 The first pointer corresponds to MFN 1, the second to MFN 2, etc.
 553
 554 Each pointer consists of two fields:
 555
 556 =over
 557
 558 =item C<RECCNT>*
 559
 560 =item C<MFCXX1>*
 561
 562 =item C<MFCXX2>*
 563
 564 =item C<MFCXX3>*
 565
 566 =item C<XRFMFB>
 567
 568 (21 bits) Block number of Master file block containing the record
 569
 570 =item C<XRFMFP>
 571
 572 (11 bits) Offset in block of first character position of Master record
 573 (first block position is 0)
 574
 575 =back
 576
 577 which are stored in a 31-bit signed integer (4 bytes) as follows:
 578
 579   pointer = XRFMFB * 2048 + XRFMFP
 580
 581 (giving therefore a maximum Master file size of 500 Megabytes).
 582
 583 Each block of the C<XRF> file is 512 bytes and contains 127 pointers. The
 584 first field in each block (C<XRFPOS>) is a 31-bit signed integer whose
 585 absolute value is the C<XRF> block number. A negative C<XRFPOS> indicates
 586 the last block.
 587
 588 I<Deleted> records are indicated as follows:
 589
 590 =over
 591
 592 =item C<XRFMFB E<lt> 0> and C<XRFMFP E<gt> 0>
 593
 594 logically deleted record (in this case C<ABS(XRFMFB)> is the correct block
 595 pointer and C<XRFMFP> is the offset of the record, which can therefore
 596 still be retrieved)
 597
 598 =item C<XRFMFB = -1> and C<XRFMFP = 0>
 599
 600 physically deleted record
 601
 602 =item C<XRFMFB = 0> and C<XRFMFP = 0>
 603
 604 inexistent record (all records beyond the highest C<MFN> assigned in the
 605 data base)
 606
 607 =back
 608
 609 =head2 E. Master file updating technique
 610
 611 =head3 Creation of new records
 612
 613 New records are always added at the end of the Master file, at the
 614 position indicated by the fields C<NXTMFB>/C<NXTMFP> in the Master file
 615 control record. The C<MFN> to be assigned is also obtained from the field
 616 C<NXTMFN> in the control record.
 617
 618 After adding the record, C<NXTMFN> is increased by 1 and C<NXTMFB>/C<NXTMFP>
 619 are updated to point to the next available position. In addition a new
 620 pointer is created in the C<XRF> file and the C<XRFMFP> field corresponding
 621 to the record is increased by 1024 to indicate that this is a new
 622 record to be inverted (after the inversion of the record 1024 is
 623 subtracted from C<XRFMFP>).
 624
 625 =head3 Update of existing records
 626
 627 Whenever you update a record (i.e., you call it in data entry and exit
 628 with option X from the editor) the system writes the record back to the
 629 Master file. Where it is written depends on the status of the record
 630 when it was initially read.
 631
 632 =head4 There was no inverted file update pending for the record
 633
 634 This condition is indicated by the following:
 635
 636 On C<XRF> C<XRFMFP E<lt> 512> and
 637
 638 On C<MST> C<MFBWB = 0> and C<MFBWP = 0>
 639
 640 In this case, the record is always rewritten at the end of the Master
 641 file (as if it were a new record) as indicated by C<NXTMFB>/C<NXTMFP> in the
 642 control record. In the new version of the record C<MFBWB>/C<MFBWP> are set to
 643 point to the old version of the record, while in the C<XRF> file the
 644 pointer points to the new version. In addition 512 is added to C<XRFMFP>
 645 to indicate that an inverted file update is pending. When the inverted
 646 file is updated, the old version of the record is used to determine the
 647 postings to be deleted and the new version is used to add the new
 648 postings. After the update of the Inverted file, 512 is subtracted from
 649 C<XRFMFP>, and C<MFBWB>/C<MFBWP> are reset to 0.
 650
 651 =head4 An inverted file update was pending
 652
 653 This condition is indicated by the following:
 654
 655 On C<XRF> C<XRFMFP E<gt> 512> and
 656
 657 On C<MST> C<MFBWB E<gt> 0>
 658
 659 In this case C<MFBWB>/C<MFBWP> point to the version of the record which is
 660 currently reflected in the Inverted file. If possible, i.e. if the
 661 record length was not increased, the record is written back at its
 662 original location, otherwise it is written at the end of the file. In
 663 both cases, C<MFBWB>/C<MFBWP> are not changed.
 664
 665 =head3 Deletion of records
 666
 667 Record deletion is treated as an update, with the following additional
 668 markings:
 669
 670 On C<XRF> C<XRFMFB> is negative
 671
 672 On C<MST> C<STATUS> is set to 1
 673
 674 =head2 F. Master file reorganization
 675
 676 As indicated above, as Master file records are updated the C<MST> file
 677 grows in size and there will be lost space in the file which cannot be
 678 used. The reorganization facilities allow this space to be reclaimed by
 679 recompacting the file.
 680
 681 During the backup phase a Master file backup file is created (C<.BKP>).
 682 The structure and format of this file is the same as the Master file
 683 (C<.MST>), except that a Crossreference file is not required as all the
 684 records are adjacent. Records marked for deletion are not backed up.
 685 Because only the latest copy of each record is backed up, the system
 686 does not allow you to perform a backup whenever an Inverted file update
 687 is pending for one or more records.
 688
 689 During the restore phase the backup file is read sequentially and the
 690 program recreates the C<MST> and C<XRF> file. At this point alt records which
 691 were marked for logical deletion (before the backup) are now marked as
 692 physically deleted (by setting C<XRFMFB = -1> and C<XRFMFP = 0>.
 693 Deleted records are detected by checking holes in the C<MFN> numbering.
 694
 695 =head1 Inverted file structure and record formats
 696
 697 =head2 A. Introduction
 698
 699 The CDS/ISIS Inverted file consists of six physical files, five of
 700 which contain the dictionary of searchable terms (organized as a
 701 B*tree) and the sixth contains the list of postings associated with
 702 each term. In order to optimize disk storage, two separate B*trees are
 703 maintained, one for terms of up to 10 characters (stored in files
 704 C<.N01>/C<.L01>) and one for terms longer than 10 characters, up to a maximum
 705 of 30 characters (stored in files C<.N02>/C<.L02>). The file C<CNT> contains
 706 control fields for both B*trees. In each B*tree the file C<.N0x> contains
 707 the nodes of the tree and the C<.L0x> file contains the leafs. The leaf
 708 records point to the postings file C<.IFP>.
 709
 710 The relationship between the various files is schematically represented
 711 in Figure 67.
 712
 713 The physical relationship between these six files is a
 714 pointer, which represents the relative address of the record being
 715 pointed to. A relative address is the ordinal record number of a record
 716 in a given file (i.e. the first record is record number 1, the second
 717 is record number 2, etc.). The file C<.CNT> points to the file C<.N0x>,
 718 C<.N0x> points to C<.L0x>, and C<.L0x> points to C<.IFP>. Because the
 719 C<.IFP> is a packed file, the pointer from C<.L0x> to C<.IFP> has two
 720 components: the block number and the offset within the block, each expressed
 721 as an integer.
 722
 723 =head2 B. Format of C<.CNT> file
 724
 725 This file contain two 26-byte fixed length records (one for each
 726 B*tree) each containing 10 integers as follows (fields marked with *
 727 are 31-bit signed integers):
 728
 729 =over
 730
 731 =item C<IDTYPE>
 732
 733 B*tree type (1 for C<.N01>/C<.L01>, 2 for C<.N02>/C<.L02>)
 734
 735 =item C<ORDN>
 736
 737 Nodes order (each C<.N0x> record contains at most C<2*ORDN> keys)
 738
 739 =item C<ORDF>
 740
 741 Leafs order (each C<.L0x> record contains at most C<2*ORDF> keys)
 742
 743 =item C<N>
 744
 745 Number of memory buffers allocated for nodes
 746
 747 =item C<K>
 748
 749 Number of buffers allocated to lst level index (C<K E<lt> N>)
 750
 751 =item C<LIV>
 752
 753 Current number of index levels
 754
 755 =item C<POSRX>*
 756
 757 Pointer to Root record in C<.N0x>
 758
 759 =item C<NMAXPOS>*
 760
 761 Next available position in C<.N0x> file
 762
 763 =item C<FMAXPOS>*
 764
 765 Next available position in C<.L0x> file
 766
 767 =item C<ABNORMAL>
 768
 769 Formal B*tree normality indicator (0 if B*tree is abnormal, 1 if B*tree
 770 is normal). A B*tree is abnormal if the nodes file C<.N0x> contains only
 771 the Root.
 772
 773 =back
 774
 775 C<ORDN>, C<ORDF>, C<N> and C<K> are fixed for a given generated system.
 776 Currently these values are set as follows:
 777
 778 C<ORDN = 5>; C<ORDF = 5>; C<N = 15>; C<K = 5> for both B*trees
 779
 780                   +--------------+
 781                   | Root address |
 782                   +-------|------+
 783                           |                          .CNT file
 784                           |                      -------------
 785                           |                          .N0x file
 786               +-----------V--------+
 787               | Key1 Key2 ... Keyn |                   Root
 788               +---|-------------|--+
 789                   |             |
 790             +-----+             +------+
 791             |                          |
 792  +----------V----------+     +---------V----------+ 1st level
 793  | Key1  Key2 ... Keyn | ... | Key1 Key2 ... Keyn |   index
 794  +--|------------------+     +-----------------|--+
 795     |                                          :
 796     :                                  +-------+
 797     |                                  |
 798  +--V------------------+     +---------V----------+ last level
 799  | Key1  Key2 ... Keyn | ... | Key1 Key2 ... Keyn |   index
 800  +---------|-----------+     +---------|----------+
 801            |                           |
 802            |                           |         -------------
 803            |                           |             .L0x file
 804  +---------V-----------+     +---------V----------+
 805  | Key1  Key2 ... Keyn | ... | Key1 Key2 ... Keyn |
 806  +--|------------------+     +--------------------+
 807     |
 808     |                                            -------------
 809     |                                                .IPF file
 810  +--V----------------------------------+
 811  | P1  P2  P3 ..................... Pn |
 812  +-------------------------------------+
 813
 814 I<Figure 67: Inverted file structure>
 815
 816 The other values are set as required when the B*trees are generated.
 817
 818 =head2 C. Format of C<.N0x> files
 819
 820 These files contain the indexes) of the dictionary of searchable terms
 821 (C<.N01> for terms shorter than 11 characters and C<.N02> for terms longer
 822 than 10 characters). The C<.N0x> file records have the following format
 823 (fields marked with * are 31-bit signed integers):
 824
 825 =over
 826
 827 =item C<POS>*
 828
 829 an integer indicating the relative record number (1 for the first
 830 record, 2 for the second record, etc.)
 831
 832 =item C<OCK>
 833
 834 an integer indicating the number of active keys in the record
 835 ( C<1 E<lt>= OCK E<lt>= 2*ORDN> )
 836
 837 =item C<IT>
 838
 839 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
 840
 841 =item C<IDX>
 842
 843 an array of C<ORDN> entries (C<OCK> of which are active), each having the
 844 following format:
 845
 846 =over 4
 847
 848 =item C<KEY>
 849
 850 a fixed length character string of length C<.LEx> (C<LE1 =10>, C<LE2 = 30>)
 851
 852 =item C<PUNT>
 853
 854 a pointer to the C<.N0x> record (if C<PUNT E<gt> 0>) or C<.L0x> record
 855 (if C<PUNT E<lt> 0>) whose C<IDX(1).KEY = KEY>. C<PUNT = 0> indicates
 856 an inactive entry. A positive C<PUNT> indicates a branch to a hierarchically
 857 lower level index. The lowest level index (C<PUNT E<lt> 0>) points the leafs in
 858 the C<.L0x> file.
 859
 860 =back
 861
 862 =back
 863
 864 =head2 D. Format of C<.L0x> files
 865
 866 These files contain the full dictionary of searchable terms (C<.L01> for
 867 terms shorter than 11 characters and C<.L02> for terms longer than 10
 868 characters). The C<.L0x> file records have the following format (fields
 869 marked with C<*> are 31-bit signed integers):
 870
 871 =over
 872
 873 =item C<POS>*
 874
 875 an integer indicating the relative record number (1 for the first
 876 record, 2 for the second record, etc.)
 877
 878 =item C<OCK>
 879
 880 an integer indicating the number of active keys in the record
 881 (C<1 E<lt> OCK E<lt>= 2*ORDF>)
 882
 883 =item C<IT>
 884
 885 an integer indicating the type of B*tree (1 for C<.N01>, 2 for C<.N02>)
 886
 887 =item C<PS>*
 888
 889 is the immediate successor of C<IDX[OCK].KEY> in this record (this is used
 890 to speed up sequential access to the file)
 891
 892 =item C<IDX>
 893
 894 an array of C<ORDN> entries (C<OCK> of which are active), each having the
 895 following format:
 896
 897 =over 4
 898
 899 =item C<KEY>
 900
 901 a fixed length character string of length C<LEx> (C<LE1=10>, C<LE2=30>)
 902
 903 =item C<INFO>
 904
 905 a pointer to the C<.IFP> record where the list of postings associated with
 906 C<KEY> begins. This pointer consists of two 31-bit signed integers as
 907 follows:
 908
 909 =over 8
 910
 911 =item C<INFO[1]>*
 912
 913 relative block number in C<.IFP>
 914
 915 =item C<INFO[2]>*
 916
 917 offset (word number relative to 0) to postings list
 918
 919 =back
 920
 921 =back
 922
 923 =back
 924
 925 =head2 E. Format of C<.IFP> file
 926
 927 This file contains the list of postings for each dictionary term. Each
 928 list of postings has the format indicated below. The file is structured
 929 in blocks of 512 characters, where (for an initially loaded and
 930 compacted file) the lists of postings for each term are adjacent,
 931 except as noted below.
 932
 933 The general format of each block is:
 934
 935 =over
 936
 937 =item C<IFPBLK>
 938
 939 a 31-bit signed integer indicating the Block number of this block
 940 (blocks are numbered from 1)
 941
 942 =item C<IFPREC>
 943
 944 An array of 127 31-bit signed integers
 945
 946 =back
 947
 948 C<IFPREC[1]> and C<FPREC[2]> of the first block are a pointer to the
 949 next available position in the C<.IFP> file.
 950
 951 Pointers from C<.L0x> to C<.IFP> and pointers within C<.IFP> consist of two
 952 31-bit signed integers: the first integer is a block number, and the
 953 second integer is a word offset in C<IFPREC> (e.g. the offset to the
 954 first word in C<IFPREC> is 0). The list of postings associated with the
 955 first search term will therefore start at 1/0.
 956
 957 Each list of postings consists of a header (5 double-words) followed by
 958 the actual list of postings (8 bytes for each posting). The header has
 959 the following format (each field is a 31-bit signed integer):
 960
 961 =over
 962
 963 =item C<IFPNXTB>*
 964
 965 Pointer to next segment (Block number)
 966
 967 =item C<IFPNXTP>*
 968
 969 Pointer to next segment (offset)
 970
 971 =item C<IFPTOTP>*
 972
 973 Total number of postings (accurate only in first segment)
 974
 975 =item C<IFPSEGP>*
 976
 977 Number of postings in this segment (C<IFPSEGP E<lt>= IFPTOTP>)
 978
 979 =item C<IFPSEGC>*
 980
 981 Segment capacity (i.e. number of postings which can be stored in this
 982 segment)
 983
 984 =back
 985
 986 Each posting is a 64-bit string partitioned as follows:
 987
 988 =over
 989
 990 =item C<PMFN>
 991
 992 (24 bits) Master file number
 993
 994 =item C<PTAG>
 995
 996 (16 bits) Field identifier (assigned from the C<FST>)
 997
 998 =item C<POCC>
 999
1000 (8 bits) Occurrence number
1001
1002 =item C<PCNT>
1003
1004 (16 bits) Term sequence number in field
1005
1006 =back
1007
1008 Each field is stored in a strict left-to-right sequence with leading
1009 zeros added if necessary to adjust the corresponding bit string to the
1010 right (this allows comparisons of two postings as character strings).
1011
1012 The list of postings is stored in ascending C<PMFN>/C<PTAG>/C<POCC>/C<PCNT>
1013 sequence. When the inverted file is loaded sequentially (e.g. after a
1014 full inverted file generation with ISISINV), each list consists of one
1015 or more adjacent segments. If C<IFPTOT E<lt>= 32768> then:
1016 C<IFPNXTB/IFPNXTP = 0/0> and C<IFPTOT = IFPSEGP = IFPSEGC>.
1017
1018 As updates are performed, additional segments may be created whenever
1019 new postings must be added. In this case a new segment with capacity
1020 C<IFPTOTP> is created and linked to other segments (through the pointer
1021 C<IFPNXTB>/C<IFPNXTP>) in such a way that the sequence
1022 C<PMFN>/C<PTAG>/C<POCC>/C<PCNT> is maintained. Whenever such a split occurs
1023 the postings of the segment where the new posting should have been inserted
1024 are equally distributed between this segment and the newly created segment.
1025 New segments are always written at the end of the file (which is maintained
1026 in C<IFPREC[1]>/C<IFPREC[2]> of the first C<.IFP> block.
1027
1028 For example, assume that a new posting C<Px> has to be inserted between C<P2>
1029 and C<P3> in the following list:
1030
1031  +----------------------------+
1032  | 0 0 5 5 5 | P1 P2 P3 P4 P5 |
1033  +----------------------------+
1034
1035 after the split (and assuming that the next available position in C<.IFP>
1036 is 3/4) the list of postings will consist of the following two segments:
1037
1038  +----------------------------+
1039  | 3 4 5 3 5 | P2 P2 Px -- -- |
1040  +--|-------------------------+
1041     |
1042  +--V-------------------------+
1043  | 0 0 5 3 5 | P3 P4 P5 -- -- |
1044  +----------------------------+
1045
1046 In this situation, no new segment will be created until either segment
1047 becomes again full.
1048
1049 As mentioned above, the posting lists are normally stored one after the
1050 other. However, in order to facilitate access to the C<.IFP> file the
1051 segments are stored in such a way that:
1052
1053 =over
1054
1055 =item 1
1056
1057 the header and the first posting in each list (28 bytes) are never
1058 split between two blocks.
1059
1060 =item 2
1061
1062 a posting is never split between two blocks; if there is not enough
1063 room in the current block the whole posting is stored in the next
1064 block.
1065
1066 =back
1067
1068 =head1 LICENCE
1069
1070 UNESCO has developed and owns the intellectual property of the CDS/ISIS
1071 software (in whole or in part, including all files and documentation, from
1072 here on referred to as CDS/ISIS) for the storage and retrieval of
1073 information.
1074
1075 For complete text of licence visit
1076 L<http://www.unesco.org/isis/files/winisislicense.html>.
1077
1078 =cut
1079