=head1 WebPAC - Search engine or data-warehouse manual It's quite hard to explain conceisly what webpac is. It's a mix between search engine and data warehousing application. Let's see that in detail... WebPAC was originally written to search CDS/ISIS records using C. Since then it has, however, adopted different other input formats and added support for alphabetical lists (earlier described as indexes). With evolution of this concept, we decided to produce following work-flow of your data: step source data CDS/ISIS, MARC, Excel, robots, ... | 0 | apply lookup rules (optional) 1 | apply input normalisation rules (xml or yaml) V intermidiate this data is re-formatted source data converted data to chunks based on tag names from config/input/ | 2 | optionally apply output filter (TT2) V data search engine, HTML, OAI, RDBMS | 3 | filter using query in REST format 4 | apply output filter (TT2) V client Web browser (html), JSON =head2 Source data WebPAC supports various input formats: =over 2 =item L CDS/ISIS data =item L for MARC records =item L Microsoft Excel C<.xls> support =item L support legacy tables (e.g. Clipper) =item L for RDF catalog data from Project Gutenberg =back =head2 Create data lookups Before you can begin normalisation, you might want to create lookups which store C<< key -> value(s) >> pair(s). Lookups are especially useful if you want to I lookup value of some other record using some sort of identifier. Lookup are described in more details in L. =head2 Normalisation to intermidiate data Intermidiate data is internal representation of data on which WebPAC operates. You are creating mappings, one-to-one from source data records to documents in WebPAC. You can split or merge data from input records, apply regexes, use lookups within same source file, do conditions, branches and/or simple evaluations while producing intermidiate data. All that is controlled with C configuration file. This file is in human-readable YAML format, and it describes all configuration of WebPAC and it's front-end Webpacus. All that is controlled with C configuration files. You will want to create fine-grained chunks of data (like separate first and last name), which will later be used to produce output. You can think of conversation process as application of C recepie on every input record. Each tag within recepie is creating one new records as long as there are fields in input format (which can be repeatable) that satisfy at least one field within tag. Users of older webpac should note that this file doesn't contain any more formatting or specification of output type and that granularity of each tag has increased. B =head2 Output filter Now that we have normalized record, we can create some output. You can create html from it, data files for search engine or insert them into RDBMS. The twist is that application of output filters can be recursive, allowing you to query data generated in previous step. This enables to you represent lists or trees from source data that have structure. This also requires to produce structured data in step 2 which can be filtered and queried in steps 3 and 4 to produce final output. You should note that you can query intermidiate data in step 4 also, not just data produced in step 2. Output filter use Template Toolkit 2, so you have full power of simple procedural language (loops, conditions) and handy built-in functions to produce output. =head2 REST Query Format Design decision is to use REST query format. This has benefit of simplicity and ability to create unique URLs to all content within webpac. Simple query format is: http://webpac/search/html/personal_name/Joe%20Doe/AND/year/LT%201995 This REST query can be broken down to: =over =item http://webpac Hostname on which service is running. Not required if doing lookups, just for browser usage. =item search Name of output filtering methods. This will specify search engine. =item html Specified template that will be used to produce output. =item perlsonal_name/Joe%20Doe... URL encoded query string. It is specific to filtering method used. =back You can easily produce RSS feed for same query using follwing REST url: http://webpac/search/rss/personal_name/Joe%20Doe/AND/year/LT%201995 Yes, it really is that simple. As it should be. =head1 Tehnical stuff Following text will be more hard-code tehnical stuff about how is webpac implemented and why. =head2 Search Engine We are using Hyper Estraier search engine using pgestraier PostgreSQL bindings for it. It should be relativly easy to plugin another one if need arise. =head2 Data Warehouse In a nutshell, webpac has evolved to support hybrid data as input. That means it has become kind of data-warehouse application. It doesn't support directly roll-up and roll-down operations, but they can be emulated using intermidiate data step or output step.