git.rot13.org Git - webpac/blob - openisis/doc/Struct.txt

   1 structuring ISIS records using subfields or subrecords
   2
   3
   4 *       structures
   5
   6 The means by which an Isis record can be structured into "data elements"
   7 ("A defined unit of information", Z39.2 a.k.a. ISO2709)
   8 fall in one of two broad categories (citing Z39.2):
   9 -       subfields
  10         "A data element considered as a component of a field."
  11         In *ML (SGML,HTML,XML...), subfields correspond to a node's attributes.
  12         In MIME, subfields correspond to attributes of a MIME header value.
  13 -       subrecords
  14         "A group of fields within a record that may be treated as a logical entity.
  15         (When a record describes more than one entity,
  16         the descriptions of individual entities may be treated as subrecords.)"
  17         In *ML, subrecords correspond to a node's childs.
  18         In MIME, subrecords correspond to multipart body parts.
  19
  20
  21 *       subfields
  22
  23 Since a field value can actually be anything,
  24 including XML text or a serialized (textual or binary) Isis record,
  25 it can be arbitrarily structured according to a regular expression
  26 or some other grammar (machine parseable or not).
  27
  28
  29 The term subfield, however, is used for a range of characters in the value
  30 which is identified by rather simple means:
  31 -       fixed
  32         if all (or all but the last) subfields have a fixed length
  33         and are neither optional nor repeatable,
  34         then each subfield can be found at a fixed position.
  35 -       delimited with optional identifier
  36         this is the proper Z39.2 notion of a subfield.
  37
  38 If a special delimiter character is found in the field,
  39 it breaks the field into subfields.
  40 Z39.2, and thus MARC, use the character 31 as delimiter
  41 (hex 1F, CTRL-_, ASCII "unit separator" US).
  42 Traditional Isis uses the caret '^'.
  43
  44 OpenIsis permits any character, including the horizontal TAB and semicolon.
  45 More precisely, OpenIsis reverts Z39.2's notion that
  46 "every subfield is INTRODUCED by a delimiter, unless it isn't"
  47 to the principle that for every data element, it is specified
  48 how it's end is detected, including by fixed length or varying delimiters.
  49
  50
  51 The initial n characters of a subfield are used to identify the subfield.
  52 Z39.2 permits any (small) fixed value for n, including 0, i.e. not identified.
  53 The MARC family of standards uses n=1.
  54 OpenIsis allows for any value, including variable length identifiers,
  55 which are themselves delimited by some character like a '=' (see below).
  56
  57
  58 Z39.2 states that if identifiers are used, each must be preceeded
  59 by a delimiter, and every data element, including the first,
  60 must be identified that way. However, an initial range of m characters
  61 (i.e. preceeding the first delimiter) in every field may serve as "indicator",
  62 which is not regarded a "data element". Again, m is a small fixed number;
  63 MARC uses m=2. Traditional Isis has no special support for indicators.
  64 OpenIsis allows to access whatever is before the first delimiter.
  65
  66
  67 Different subfielding methods can be mixed or nested.
  68 Typical cases are:
  69 -       mixed fixed/delimited
  70         After some initial fixed subfields, following subffields are delimited.
  71         This can be used to describe MARC's fixed indicators.
  72 -       nested delimited/fixed
  73         A delimited subfield has itself a fixed substructure.
  74         Actually the leading identifier in a subfield can be regarded
  75         as fixed part in a mixed substructure.
  76 -       nested unidentified delimited
  77         A delimited subfield has itself a delimited structure.
  78         This can be used to model variable length identifiers.
  79
  80 In other words, identifiers are themselves nothing but subfields
  81 used as keys on some level of nesting.
  82 On the other hand, any subfield could serve as a key for it's parent.
  83 This is used e.g. to select a field by a subfield indicating a language
  84 (see below for keyed subrecords).
  85
  86 If you look at the
  87 >       Serialized      plaintext representation of an Isis record,
  88 actually the whole record is a newline delimited value,
  89 the whole database is a blankline (double newline) delimited value
  90 and each field has it's tag as initial tab-delimited subfield.
  91
  92
  93 In the future, OpenIsis will add support for a wide variety of
  94 subfielding techniques such as defined by regular expressions,
  95 MIME headers or produced in typical "character/comma separated values" files
  96 (opionally using quotes).
  97
  98 Since splitting subfields is mostly and can always be done on the
  99 application level (i.e. a database server rarely needs to care),
 100 "support" essentially boils down to the definition of appropriate meta data.
 101
 102
 103 *       subrecords
 104
 105 A subrecord consists of a typically continuous range of fields within a record,
 106 started by some field to introduce the subrecord.
 107 Some variants, however, like keyed subfields,
 108 can be freely scattered and don't need a "header" field.
 109
 110
 111 There are basically four ways to denote the boundaries of structures:
 112 -       embraced
 113         where a special field is used to denote the structures end.
 114         This resembles SGML-style notations,
 115         where each opening tag is matched by a closing tag.
 116         This is relatively easy and recommended for every day use.
 117 -       marked
 118         where the fields of the child structure are marked as such.
 119         This is sort of the opposite approach of embracing.
 120         Marking comes in several powerful flavours,
 121         see below for a more detailled discussion.
 122 -       counted
 123         where the number of fields (not childs) belonging to the
 124         structure is given in (any leading digits of) the initial field.
 125         This allows for safe embedding regardless of the
 126         structure's contents and is thus used in contexts where
 127         full generality is needed like when embedding result records
 128         within a server's response.
 129 -       implicit
 130         where the number of childs is fixed.
 131         An example of this is the parse tree of a query,
 132         where the structure "AND" has exactly two childs
 133         (which in turn might be structures).
 134         This is used mostly for internal structures like parsed
 135         queries or formats, which are not meant to be exchanged.
 136
 137 The field introducing a subrecord might have any subfields
 138 just like other fields, similar to the attributes that might
 139 be assigned to a tag in SGML applications like HTML.
 140
 141 However, the first subfield (unidentified initial characters)
 142 of a field opening an embraced or counted subrecord is reserved as indicator:
 143 -       a plus sign '+' as first character
 144         indicates explicity opening a subrecord
 145 -       a minus sign '-' as first character
 146         indicates an empty subrecord (containing no childs)
 147 -       an empty value
 148         indicates explicity closing a subrecord
 149         (similar to the closing blank line used in several protocols)
 150 -       an initial numeric value
 151         (of decimal digits) gives the number of fields to follow.
 152 -       an initial character @A-Z
 153         gives the number of childs to follow (@=0,A=1,B=2...) (rarely used)
 154
 155 Auxiliary information about the child,
 156 like an embedded records row number and type,
 157 are stored in subfields of the parent.
 158
 159
 160 *       conventions
 161
 162 While the intented usage of subrecords might be specified in
 163 more detail in the
 164 >       Meta    table metadata
 165 , the schema can also be used standalone (without referring to metadata),
 166 if some conventions on tag ranges are followed.
 167
 168 The extend of subrecords by length or braces can be safely
 169 determined if you just know that you want the given field
 170 to be regarded as subrecord.
 171
 172 For subrecords of fixed number of childs (meant for internal use),
 173 it is necessary to recognize whether a following field is itself a structure.
 174 If they are used at all, the tag range -1..-99 should be reserved for this
 175 purpose.
 176
 177 In this context, typically one of two modes is used:
 178 -       the MIME processing mode for processing list-style content,
 179         assumes that negative tags denote structures,
 180         while positive contain plain data.
 181 -       in XML processing mode, everything but the 0 tag (text node) is a structure.
 182
 183 If a parent has a subfield ^0,
 184 that should contain the childs identity as dbname or mfn or dbname.mfn.
 185 If the parents indicator is delimited by a tab instead of a ^,
 186 the next tab-delimited subfield is interpreted that way (where applicable).
 187
 188
 189 *       marked structures
 190
 191 There is a wide variety of techniques for marking fields as "childs"
 192 of other fields. Marking techniques work especially well for a single
 193 level of substructuring; for nested structures, some restrictions apply.
 194
 195 We give some commonly used examples:
 196
 197 -       quoting
 198         is done by prefixing every child field value with a special string,
 199         which is not used as prefix outside the child fields.
 200         However, at least for a single level of quoting, it does not impose
 201         a problem if the child fields themselves started with the same prefix:
 202         Still, the original value is retrieved by stripping the (first) prefix.
 203         This even works for multiple levels, as long as the record was properly
 204         constructed, i.e. the quoting prefix is not used outside childs.
 205         Examples are the output of the diff command (which is driving the
 206         RCS/CVS revision control system very reliably) and the '&gt;' quoting
 207         used in e-mail replies.
 208 -       tagging
 209         Instead of the field value, of course also the field tag can be used
 210         as child mark. In some situations it might be possible to choose
 211         appropriate reserved tags for the childs.
 212         In other situations, where some given child tag must be kept,
 213         it can be stored as prefix in the field value according to the canonical
 214 >       Serialized
 215         plain text format.
 216 -       keying
 217         If the mark used is dependent on an attribute of the parent field,
 218         the childs can be determined even if non-continuous.
 219         With some more cooperation of the childs, the mark might be an
 220         attribute (subfield) instead of a prefix (indicator).
 221         That way, childs and parents are linked together rather logically
 222         than "physically" by a common key just like in relational databases.
 223         This easily extends to multiple levels using segmented keys
 224         (consisting of several attributes/subfields).
 225         While this scheme only works with well behaved childs and may waste
 226         some space by replicating keys, it is simple and robust and gives
 227         convenient access to the childs without inspecting the structure.
 228
 229
 230 *design childs vs. attributes
 231
 232 Every information that can be represented using an attribute,
 233 can also be represented using a child.
 234 From that point of view, attributes are a redundant "language" construct
 235 and one might deem a model using only childs as the simpler one.
 236 We call such an attributeless model "canonical verbose" representation.
 237 It's a little bit similar to the "everything is an object"
 238 approach of pure OO languages like Smalltalk.
 239
 240
 241 But then, having a richer language isn't always such a bad thing,
 242 if you know how to use it appropriately.
 243 (This "if" is the core of almost any serious criticism of rich languages,
 244 but for now, let's assume we know what we're doing).
 245 Appropriate use basically boils down to choosing the language construct
 246 that was just made for your situation, i.e. not the most general one,
 247 but quite to the opposite the most specific (restricted) one.
 248 That way you will not only have the most efficient representation,
 249 but also express additional information about what's going on.
 250
 251
 252 In short, a "canonical compact" modelling can be based upon the principle
 253 "Use attributes wherever possible".
 254
 255 Some logical property of a logical structure can be represented by
 256 means of attributes, if
 257 -       it is simple,
 258         i.e. one single string value.
 259 -       or at least flat,
 260         i.e. itself a structure that can be represented based on attributes
 261         that do not interfere with the parents attributes.
 262         In the latter case, the property will show up as several
 263         logically interrelated attributes of the parent.
 264         However, such a flat group of attributes might be a candidate
 265         for a child under some circumstances.
 266 -       it is not repeatable.
 267         Although OpenIsis supports repeated subfields as used by some MARCs,
 268         XML/SGML attributes can not be repeated.
 269         (Technically, they can, but there neither is defined semantics for
 270         repeated attributes nor is access supported by parsers or the DOM).
 271         Moreover, traditional CDS/ISIS implementations do not support
 272         repeated subfields, so it's probably a good idea to not use them
 273         without a pretty good reason.
 274
 275 Basically, when you think C, one field's attributes take everything
 276 that goes into a simple struct, without using arrays or pointers.
 277
 278
 279 The detailled modelling should also take into account the intended usage.
 280
 281 For example, one might devise some attribute candidates to childs, if
 282 -       they are likely to be accessed or modified together
 283         but independent of other properties
 284 -       they are candidates to be inherited or overridden as a group in a
 285 >       PatchWork
 286 -       the parent would otherwise become very large
 287
 288
 289 *variants       variant structures
 290
 291 The C language construct of a "union" is frequently used in bibliographic
 292 databases. The typical form resembles the PASCAL "variant record",
 293 using an initial field as indicator for the usage of the given field.
 294 Sometimes, however, the more liberal C practice is used,
 295 where the intented interpretation is specified somewhere in the record, somehow.
 296
 297 A similar construct is used in ALGOL-derived OO languages like C++ or Java,
 298 where the indicator (of what object is this ?) is out-of-band data
 299 (i.e. cannot be modified or inspected like any other data).
 300
 301
 302 In Isis records, fields always have a tag
 303 (and subfields commonly have an identifier) indicating the kind of data.
 304 Therefore, there is little need to introduce another level of switches.
 305 A canonically decomposed model
 306 -       would not reuse fields or subfields with different structure
 307 -       would not contain rules like
 308         "if subfield a has value b then subfield c must be present"
 309
 310 However, on the other hand, full decomposition might be tedious and
 311 even hide relationships. Moreover, from a given point of view,
 312 tags and identifiers are just ordinary subfields on some level.
 313
 314
 315 In general, if the same tag is used for variants of a field,
 316 the risk of misinterpretation of data should be minimized by
 317 not reusing the same subfields with different structure.
 318 After all, defining another indicator and ignoring an unexpected subfield
 319 or moaning on the lack of an expected one is cheaper and more robust and clear
 320 than verifying an expected structure based on other subfield values.
 321
 322
 323 *       examples
 324
 325 A typical HTML table definition starting with
 326 $
 327 <table width="100%" cellpadding="0" cellspacing="0"
 328   marginwidth="0" marginheight="0" topmargin="0" leftmargin="0" border="0">
 329 <tr>
 330 <td valign="top" width="160">
 331 this is the textbody <br/> of the td node
 332 </td>
 333 </tr>
 334 ...
 335 $
 336 will be compacted to, say,
 337 $
 338 100     +^w100%^p0^s0^m0^h0^t0^l0^b0
 339 101     +
 340 102     +^vtop^w160
 341 0       this is the textbody
 342 103     -
 343 0       of the td node
 344 102
 345 101
 346 ...
 347 $
 348 For a detailed description of the transformation, see
 349 >       xmlisis the XML-ISIS doku
 350
 351 A six field result record might be embedded within a response like
 352 $
 353 908     6       cds.47
 354 24      Hydrological achievements and social problems
 355 ...
 356 $
 357
 358 Assuming we gave tag -20 to "OR" (and 0 to a literal),
 359 the query "plant OR water" might be parsed to
 360 $
 361 -20     B
 362 0       plant
 363 0       water
 364 $
 365
 366 "frog AND (plant OR water)" might look like, if -21 is assigned to "AND"
 367 $
 368 -21     B
 369 0       frog
 370 -20     B
 371 0       plant
 372 0       water
 373 $
 374
 375 For implicit tags, the number of childs is redundant
 376 (fixed per tag in a given use) and will typically be omitted.
 377
 378 ---
 379         $Id: Struct.txt,v 1.8 2003/06/23 14:44:29 kripke Exp $