Technical Documentation for Darwin Online

v1.1 January 17^th 2007

Prepared by: Antranig Basman (amb26@ponder.org.uk)

About this Document:

This document describes the technical solution adopted by Darwin Online, aimed primarily at those seeking to adapt the technology for similar digitization projects, as well as future maintainers of Darwin Online itself.

The exposition will begin at the "back end", being the central bibliographic database defining the overall record structure of the entire project, following its fate through processing by the Java/Lucene indexing process, integration with the full text corpus, to final rendering by the front end.

This document should be read in conjunction with two others – firstly, "Date Encoding Standard for Darwin Online" (current ver. 1.2 of January 17th 2006) (DES) and "Content Markup Standard for Darwin Online" (current ver. v1.1 October 11th 2005) (CMS).

Overall Structure:

Back End:

The bibliographic database, which defines the "spine" of the record system, is managed under a simple schema in a Microsoft Access database, under the personal control of the Project Director, Dr John van Wyhe. Periodically, the database contents are exported in a CSV format where they are parsed by Java code from the server’s back-end, and indexed into a database managed by Lucene (http://lucene.apache.org/), a popular open source Java search engine. During this indexing procedure, the records are merged with the full text/digitised image corpus, which is separately maintained in a free-form structure in the local filesystem. Database records and corpus entries are corresponded by a naming convention placed on corpus filenames, as well as markup standards defined in CMS. The free text portion of the corpus is held in XHTML documents with an additional light encoding specified by the CMS.

Front End:

The front end of the system renders both bibliographic records and corpus elements to the user, as well as serving as the interface to the search engine. The front end is implemented as a Java Servlet using the RSF presentation technology (http://www.caret.cam.ac.uk/rsfwiki).

Corpus entries/items are presented in a frames-based interface, which is capable of 3 modes of operation – i) a paged XHTML representation of the free text of the item (where this is held), ii) a paged image representation of the item, or iii) an interface presenting a synchronized representation of the free text and image versions of each page, where these are both held.

The search engine is presented in a set of views which allow separate access to the manuscript collection, as distinct from Darwin’s published works (although in fact these two views are in fact handled by the same RSF view). These views allow simultaneous restriction of the search results both by bibliographic field contents as well as by free text searches in the corpus (where this is held for an item). Hits from the search engine show a highlighted set of keywords in context (KWIC) in the style made familiar from Google and other engines, which may be ordered by date, relevance of ID. Links followed from this result set also show a view of the interface for that item with the keywords highlighted globally. The requirement for this highlighted view is the principal reason that a dynamic rendering strategy for content rendering was chosen, rather than a pre-publication strategy.

Hosting Requirements:

The complete system as currently deployed for darwin-online.org.uk has the following hosting requirements:

J2EE container suitable for hosting Java servlets (in practice, Apache’s Tomcat is typically used)
RDBMS for holding page counts (very light usage is made, very low space/CPU requirements – MySQL is currently used for this)
Adequate filesystem space for holding the free text and digital facsimile corpus, search engine indexes and software – the corpus is currently running on the order of 50Gb.
Front-end server for hosting static content – this was particularly critical in the early launch of the project when the server was receiving extremely heavy usage (500Gb/day) but could probably now be dispensed with. The Apache server is in use.
Moderate external bandwidth – the system is currently consuming around 500Gb/month.

Bibliographic Database:

All data in the database is held in the UTF-8 encoding. The schema forms a single "star" with the central table representing a bibliographic entry, entitled "Item". The current Access schema for Item contains a number of fields which are superfluous or internal to the maintainers – the fields which are interpreted by the Java backend are listed in the following table. Some fields are "synthetic" and deduced by the backend during indexing from external (such as the free text/image corpus) or internal information – this is highlighted in the table.

Note that the bulk of the contents of this table are determined in the Java file ItemFieldRegistry.java, except for the synthetic fields which are generally determined in ItemIndexUpdater.java.

Database Column	Lucene Field	Description
Identifier	itemID, identifier	Primary key for this record
Title	exacttitle	The title physically associated with this item Ä
AttributedTitle	attributedtitle	The title by which this item is commonly cited or attributed Ä
ConciseReference	reference	For Freeman entries, a completely rendered reference for citing this item Ä
LanguageID	language	The language for this item ® Language
PlacePublishedID	place	Place of publication ® Place
PartOfDocumentID	documenttype	The type of this item ® PartOfDocument
PublisherID	publisher	Publisher for this item ® Publisher
PeriodicalID	periodical	Periodical in which this item appeared ® Periodical
Name	name	Name of a person associated with this item (e.g. author, or sender of a letter) Ä
Xref	xref	Cross-reference to another system (in general, the Darwin Correspondence)
<Synthesized>	havetext	Whether the corpus contains a digital text representation of the item (determined from the filesystem)
<Synthesized>	haveimages	Whether the corpus contains a facsimile image representation of the item (determined from the filesystem)
<Synthesized>	manuscript	Whether the item falls into the "manuscript" category (determined from field "documenttype")
<Synthesized>	published	Whether the item falls into the "published" category (determined from field "documenttype")
<Synthesized>	startdate	The start of the "date uncertainty range" for this item – determined from the Database column "Date" (see DES)
<Synthesized>	enddate	The end of the "date uncertainty range" for this item – determined from the Database column "Date" (see DES)
<Synthesized>	searchid	A tokenised version of the identifier key, that may be searched with wildcards
<Synthesized>	searchtitle	The "inferred title" for the item, derived from the other fields marked Ä (see below)
<Synthesized>	sorttitle	The title to be used when search hits are sorted in title order
DateString	displaydate	The "display date" for the record (see DES)

Notes: i) The fields marked Ä participate in the "inferred title" scheme. Due to the mixture of different types of items in the database, the actual title to be used for item display may be found in one of a number of fields. The following fields are searched in order, and the first nonempty entry is chosen for the display title (logic in ItemIndexUpdater.java): Title, AttributedTitle, Description, Name.

ii) The fields marked ® are references to foreign keys in subsidiary tables. The subsidiary tables used in the system are Language, PartOfDocument, Periodical, Place and Publisher.

Database Export procedure:

The table Item, and the 5 subsidiary tables listed in note ii) above are exported from Microsoft Access using the "export as CSV" function. All tables are to be exported with the UTF-8 character set in use (selected from the "Advanced" tab in the export dialog). The table "Item" is to be exported including the column names listed in the first line of the file, the 5 other tables are to be exported without column names (these tables only contain one useful column other than the primary key). These files are placed in the subdirectory "database" of the deployed webapp, named with the table name with the ".txt" extension, e.g. Item.txt, Language.txt.

Filesystem corpus

The files forming the digital text and facsimile corpus are stored in a free-form structure in the filesystem, to allow maximum freedom to the transcribing and management team. They are recognised by being placed in the subdirectory "converted" of the webapp’s installation directory. Digitised texts are stored in an XHTML format described in document CMS. The filenames of these XHTML files are entirely unconstrained – the contents of these files are recognised by the extra attribute markup described in CMS .

Naming convention for facsimile images:

Facsimile images may be stored in any image format supported by commonly deployed browsers – the majority are currently stored in .jpg format. These page images are recognised by a convention placed on their filenames. This convention recognises the final parts of filenames (immediately before the file type extension) as delimited by the underscore character _. The final part of the filename this delimited is taken to be the page sequence number of the image, as corresponding with the dar:pageseq attribute defined in CMS. The penultimate part is taken to be the item ID, corresponding with the Identifier primary key in the Item table above, as well as the dar:id attribute described in CMS. This allows a unique correspondence of each page image, digitised text page and bibliographic record, where such a correspondence exists.

In order to mark an image in the corpus tree which should not be considered part of the facsimile set (generally because it is an internal image referenced directly from the XHTML corpus), the final delimited filename segment should begin with the characters "fig".

For example, the following filename:

1851_geology_F328_32.jpg

corresponds to the page with sequence number 32 for the item with ID F328. Note that the filename portion before the penultimate underscore is arbitrary, for convenience of the editors.

The following filename:

1882_cholorophyll_F1801_fig1.jpg

corresponds to an "internal resource" referenced by the text corpus, and will be ignored in a sweep of the filesystem. Any file with an image extension that does not conform to one of the above two cases will be signalled as an error when traversing the filesystem.

Filesystem sweep:

The filesystem under "converted" is traversed recursively, looking for all files with the "html" extension which are expected to conform to the CMS, and files with common image extensions which are expected to conform to the naming convention above. Any failures are reported in the status page of the application. The status page is the view with name "status", which lists the transcript of errors from the most recent filesystem sweep. A filesystem sweep is always performed on first initialisation of the servlet, and whenever triggered by the button present on this status page (successful triggering of the button is password-protected, to prevent DOS attacks).

Search Index

The search index for the bibliographic database and text corpus is held in a unified Lucene index, which has one entry for each item, as well as an entry for each page sequence which has textual content. These two types of Lucene Document are distinguished by the document field named "type", which holds the value "item" representing an entire item, and "page" representing a page.

A Document of type "item" contains fields with the values shown in the "Lucene Field" column in the table on page 2. A Document of type "page" also contains a copy of each of these fields for its parent item, as well as a further field of named "text" holding the tokenized contents of the corpus text for that page, as well as a field "flat-text" holding the raw untokenized text, which is used when rendering search hits. Documents also contain various other synthesized fields, whose names are held in the Java class uk.org.ponder.darwin.search.DocFields. These hold a record of the page sequence range stored for an item, as well as the filesystem date last seen for the corpus file corresponding to it.

Indexing procedure:

The Lucene indexing procedure is very much more expensive than the filesystem sweep, and as a result has only been provided with an off-line implementation, to ensure greater controllability of the potentially expensive process. As a rough guide, on the current deployment the filesystem sweep takes on the order of 60 seconds, whereas a full Lucene reindex takes on the order of 30 minutes.

Triggering a Lucene index involves invoking the Java class test.TestIndex, which is held in the main Darwin package (see below for details of code structure). The single argument to this class is a Spring-formatted XML context file which holds the configuration for the index system. Sample files are held in the directory "conf" in the same project, with the file buildContext-beast.xml corresponding to the configuration used on the production server. For example, this configures the location of the generated lucene index with the following declaration:

<bean id="indexDirectory" class="java.lang.String">

<constructor-arg>

<value>/home/darwin/temp-lucene-index</value>

</constructor-arg>

</bean>

The index files, once generated, are to be copied to the final path /home/darwin/lucene-index by hand, and the servlet is then stopped and restarted. This procedure guards against any inconsistent modifications or access to the index database by the running system – whilst simultaneous updates and access are advertised as possible by Lucene, this has been found troublesome in practice.

There are other versions of buildContext.xml in the conf directory suitable for other environments – for example buildContext-home.xml is suitable for local testing on a developer machine.

The indexer requires access to the complete set of JAR files deployed by the application. These are mostly already present in the "lib" subdirectory of the webapp, with one exception, the JAR servletapi-1.3.jar which is hosted internally within the Tomcat server. Therefore the most convenient way to invoke the indexer is to create a fresh directory containing these JARs plus the missing file named "indexlib" and to issue a command like the following:

java -Djava.ext.dirs=indexlib test.TestIndex buildContext-beast.xml

The User Interface

The user interface for Darwin Online is defined by a set of pure-HTML templates rendered by the RSF presentation framework. Some technical details of the interface are presented here.

Page Counters:

The page counters for the system are implemented by a simple AJAX (or more specifically AHAH http://en.wikipedia.org/wiki/AHAH) view which fetches a short page-specific HTML fragment from the server to update a <div> tag on the current page. This system was chosen in preference to the more standard but old-fashioned system based around server-fabricated images on the grounds of increased efficiency and usability. The page counter system is currently the only use made of the backend database. This contains a single, thin but long table with one entry for each URL that has been detected by the system. A schematic for this table, called "pagecount" appears below:

Column	Type	Description
ID	Integer	Primary key
URL	String	Raw URL as detected locally in browser pane
URLHash	String	An SHA-1 hash of the URL, an index is applied to this column for rapid lookup.

The Java backend currently accesses this table by means of a simple Hibernate (http://www.hibernate.org/) mapping hidden behind a DAO interface.

The AHAH strategy is suitable for those pages which are served as static content (currently by the Apache frontend). Those pages which are rendered by the Java backend which require to have the page counter markup embedded directly in the markup on initial rendering, to avoid a further trip to the server.