git.rot13.org Git - webpac2/blob - lib/WebPAC/Manual.pod

   1 =head1 WebPAC - Search engine or data-warehouse manual
   2
   3 It's quite hard to explain conceisly what webpac is. It's a mix between
   4 search engine and data warehousing application. Let's see that in detail...
   5
   6 WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
   7 Since then it has, however, adopted different other input formats and added
   8 support for alphabetical lists (earlier described as indexes).
   9
  10 With evolution of this concept, we decided to produce following work-flow
  11 of your data:
  12
  13   step
  14
  15    source file          CDS/ISIS, MARC, Excel, robots, ...
  16       |
  17   1   | apply input normalisation rules (xml or yaml)
  18       V
  19    intermidiate         this data is re-formatted source data converted
  20      data               to chunks based on tag names from config/input/
  21       |
  22   2   | optionally apply output filter (TT2)
  23       V
  24      data               search engine, HTML, OAI, RDBMS
  25       |
  26   3   | filter using query in REST format
  27   4   | apply output filter (TT2)
  28       V
  29     client              Web browser, SOAP
  30
  31 =head2 Normalisation and Intermidiate data
  32
  33 This is first step in working with your data.
  34
  35 You are creating mappings, one-to-one from source data records to documents
  36 in webpac. You can split or merge data from input records, apply filters
  37 (perl subroutines), use lookups within same source file or do simple
  38 evaluations while producing output.
  39
  40 All that is controlled with C<config/input/> configuration file. You
  41 will want to create fine-grained chunks of data (like separate first and
  42 last name), which will later be used to produce output. You can think of
  43 conversation process as application of C<config/input/> recepie on
  44 every input record.
  45
  46 Each tag within recepie is creating one new records as long as there are
  47 fields in input format (which can be repeatable) that satisfy at least one
  48 field within tag.
  49
  50 Users of older webpac should note that this file doesn't contain any more
  51 formatting or specification of output type and that granularity of each tag
  52 has increased.
  53
  54 B<this document should really be updated to reflect Webpacus front-end from
  55 this point...>
  56
  57 =head2 Output filter
  58
  59 Now that we have normalized record, we can create some output. You can create
  60 html from it, data files for search engine or insert them into RDBMS.
  61
  62 The twist is that application of output filters can be recursive, allowing
  63 you to query data generated in previous step. This enables to you represent
  64 lists or trees from source data that have structure. This also requires to
  65 produce structured data in step 2 which can be filtered and queried in steps
  66 3 and 4 to produce final output.
  67
  68 You should note that you can query intermidiate data in step 4 also, not
  69 just data produced in step 2.
  70
  71 Output filter use Template Toolkit 2, so you have full power of simple
  72 procedural language (loops, conditions) and handy built-in functions to
  73 produce output.
  74
  75 =head2 REST Query Format
  76
  77 Design decision is to use REST query format. This has benefit of simplicity
  78 and ability to create unique URLs to all content within webpac. Simple query
  79 format is:
  80
  81   http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
  82
  83 This REST query can be broken down to:
  84
  85 =over
  86
  87 =item http://webpac
  88
  89 Hostname on which service is running. Not required if doing lookups, just
  90 for browser usage.
  91
  92 =item search
  93
  94 Name of output filtering methods. This will specify search engine.
  95
  96 =item html
  97
  98 Specified template that will be used to produce output.
  99
 100 =item perlsonal_name/Joe%20Doe...
 101
 102 URL encoded query string. It is specific to filtering method used.
 103
 104 =back
 105
 106 You can easily produce RSS feed for same query using follwing REST url:
 107
 108   http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
 109
 110 Yes, it really is that simple. As it should be.
 111
 112 =head1 Tehnical stuff
 113
 114 Following text will be more hard-code tehnical stuff about how is webpac
 115 implemented and why.
 116
 117 =head2 Search Engine
 118
 119 We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
 120 for it.
 121
 122 It should be relativly easy to plugin another one if need arise.
 123
 124 =head2 Data Warehouse
 125
 126 In a nutshell, webpac has evolved to support hybrid data as input. That
 127 means it has become kind of data-warehouse application. It doesn't support
 128 directly roll-up and roll-down operations, but they can be emulated using
 129 intermidiate data step or output step.
 130