git.rot13.org Git - webpac2/blob - lib/WebPAC/Manual.pod

   1 =head1 WebPAC - Search engine or data-warehouse manual
   2
   3 It's quite hard to explain conceisly what webpac is. It's a mix between
   4 search engine and data warehousing application. Let's see that in detail...
   5
   6 WebPAC was originally written to search CDS/ISIS records using C<swish-e>.
   7 Since then it has, however, adopted different other input formats and added
   8 support for alphabetical lists (earlier described as indexes).
   9
  10 With evolution of this concept, we decided to produce following work-flow
  11 of your data:
  12
  13   step
  14
  15    source data      CDS/ISIS, MARC, Excel, robots, ...
  16       |
  17   0   | apply lookup rules (optional)
  18   1   | apply input normalisation rules (xml or yaml)
  19       V
  20    intermidiate     this data is re-formatted source data converted
  21      data           to chunks based on tag names from config/input/
  22       |
  23   2   | optionally apply output filter (TT2)
  24       V
  25      data           search engine, HTML, OAI, RDBMS
  26       |
  27   3   | filter using query in REST format
  28   4   | apply output filter (TT2)
  29       V
  30     client          Web browser (html), JSON
  31
  32 =head2 Source data
  33
  34 WebPAC supports various input formats:
  35
  36 =over 2
  37
  38 =item L<WebPAC::Input::ISIS> CDS/ISIS data
  39
  40 =item L<WebPAC::Input::MARC> for MARC records
  41
  42 =item L<WebPAC::Input::Excel> Microsoft Excel C<.xls> support
  43
  44 =item L<WebPAC::Input::DBF> support legacy tables (e.g. Clipper)
  45
  46 =item L<WebPAC::Input::Gutenberg> for RDF catalog data from Project Gutenberg
  47
  48 =back
  49
  50 =head2 Create data lookups
  51
  52 Before you can begin normalisation, you might want to create lookups which store
  53 C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to
  54 I<well> lookup value of some other record using some sort of identifier.
  55
  56 Lookup are described in more details in L<WebPAC::Lookup>.
  57
  58 =head2 Normalisation to intermidiate data
  59
  60 Intermidiate data is internal representation of data on which WebPAC operates.
  61
  62 You are creating mappings, one-to-one from source data records to documents
  63 in WebPAC. You can split or merge data from input records, apply regexes,
  64 use lookups within same source file, do conditions, branches and/or
  65 simple evaluations while producing intermidiate data.
  66
  67 All that is controlled with C<config/config.yml> configuration file.
  68 This file is in human-readable YAML format, and it describes all configuration of
  69 WebPAC and it's front-end Webpacus.
  70
  71
  72 All that is controlled with C<config/input/> configuration files. You
  73 will want to create fine-grained chunks of data (like separate first and
  74 last name), which will later be used to produce output. You can think of
  75 conversation process as application of C<config/input/> recepie on
  76 every input record.
  77
  78 Each tag within recepie is creating one new records as long as there are
  79 fields in input format (which can be repeatable) that satisfy at least one
  80 field within tag.
  81
  82 Users of older webpac should note that this file doesn't contain any more
  83 formatting or specification of output type and that granularity of each tag
  84 has increased.
  85
  86 B<this document should really be updated to reflect Webpacus front-end from
  87 this point...>
  88
  89 =head2 Output filter
  90
  91 Now that we have normalized record, we can create some output. You can create
  92 html from it, data files for search engine or insert them into RDBMS.
  93
  94 The twist is that application of output filters can be recursive, allowing
  95 you to query data generated in previous step. This enables to you represent
  96 lists or trees from source data that have structure. This also requires to
  97 produce structured data in step 2 which can be filtered and queried in steps
  98 3 and 4 to produce final output.
  99
 100 You should note that you can query intermidiate data in step 4 also, not
 101 just data produced in step 2.
 102
 103 Output filter use Template Toolkit 2, so you have full power of simple
 104 procedural language (loops, conditions) and handy built-in functions to
 105 produce output.
 106
 107 =head2 REST Query Format
 108
 109 Design decision is to use REST query format. This has benefit of simplicity
 110 and ability to create unique URLs to all content within webpac. Simple query
 111 format is:
 112
 113   http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995
 114
 115 This REST query can be broken down to:
 116
 117 =over
 118
 119 =item http://webpac
 120
 121 Hostname on which service is running. Not required if doing lookups, just
 122 for browser usage.
 123
 124 =item search
 125
 126 Name of output filtering methods. This will specify search engine.
 127
 128 =item html
 129
 130 Specified template that will be used to produce output.
 131
 132 =item perlsonal_name/Joe%20Doe...
 133
 134 URL encoded query string. It is specific to filtering method used.
 135
 136 =back
 137
 138 You can easily produce RSS feed for same query using follwing REST url:
 139
 140   http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995
 141
 142 Yes, it really is that simple. As it should be.
 143
 144 =head1 Tehnical stuff
 145
 146 Following text will be more hard-code tehnical stuff about how is webpac
 147 implemented and why.
 148
 149 =head2 Search Engine
 150
 151 We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings
 152 for it.
 153
 154 It should be relativly easy to plugin another one if need arise.
 155
 156 =head2 Data Warehouse
 157
 158 In a nutshell, webpac has evolved to support hybrid data as input. That
 159 means it has become kind of data-warehouse application. It doesn't support
 160 directly roll-up and roll-down operations, but they can be emulated using
 161 intermidiate data step or output step.
 162