Content Markup Standard for Darwin Online

v1.1 July 2007

Prepared by: Antranig Basman (amb26@ponder.org.uk)

General Aims:

As far as possible, existing working practice for Darwin Online is to be preserved. A considerable body of text encoded to XHTML, and skills corresponding to management of the same have been accumulated, and this encoding standard seeks to allow these texts to be efficiently indexed and assembled by automated processes, while leaving their structure as far as possible intact.
Maximum compliance with public and established encoding standards is a high priority. At the very least, the encoded documents must remain valid XML.
Although the content is presented to users through a process of markup rewriting, it is preferable to preserve the validity of the static content of Darwin Online as a valid web structure in its own right. Rewriting is essential to obtain one of the search engine goals, that of line-precise search hit link targets.

Constraints:

The ideal solution involves extending the XHTML DTD with an internal subset, published at the head of the document. However, ironically while this solution preserves compliance with the letter of the XML standard, none but the most modern browsers render such documents correctly, and then only under restricted cases of file extension and MIME type as to be mutually incommensurable with one another.

Result:

In practice, one policy which all test web browsers agree on is the silent elision of unrecognised tag attributes, especially where these fall part of an alien (to XHTML) namespace. An internal subset specifying all attributes may be prepared, allowing a service to be provided serving fully XML validating renderings of the content pages to those clients who request such. These users do not however form more than a very small portion of the target audience of the project, and this capability is left in reserve for future development.

Specific Aims:

The primary aim of the markup is to allow a precise correspondence between individual markup pages, and their parent record in the bibliographic database.
The secondary aim is to aid navigation by establishing a correspondence between markup pages and bitmapped renditions of their document originals.
A further aim is to aid the process of establishing correspondence between different editions of the same work, by providing at least sectional markup at the levels of chapters or other such divisions where they are present in the text. (This is also deferred subject to funding.)

General Policy:

All markup will be presented as attributes applied to otherwise valid XHTML tags. These attributes will all be qualified with the namespace prefix dar:, which will be interpreted as if specified by the namespace declaration xmlns:dar="http://www.darwin-online.org.uk/schema". For reasons of installed browser compatibility this declaration will not actually be placed in the raw files produced as a primary output of the project – the declaration, like the internal subset mentioned above, may be synthesised by a rendering process if required for particular clients.

Page correspondence:

Markup pages must correspond with bitmapped renditions, and also multiple markup documents specifying different page ranges must be reassembled to reconstruct the original physical page sequence of the document. Together with the requirement for reader correspondence to the original text, this necessitates the maintenance and correspondence of two page numbers for each page, firstly the physical number as printed on the page and secondly the sequential number specifying the numerical offset in pages from the start of the printed matter of the book or manuscript. Sequential page numbers are positive integers, beginning numbering with the first printed page in the book as 1.

Existing practice of the project denotes page numbers as follows: <p>[page] 266</p>.

To meet the correspondence requirements, this markup will be extended as follows:

<p dar:class="page">[page] 266</p>

Correspondence with bibliographical database:

Whole documents, or sections of them, will be marked up with the attribute dar:class="document" applied to an enclosing tag. To minimise burdens on the internal subset, we specify provisionally that this tag should be of type div. The actual bibliographic ID will be specified with an additional attribute dar:id="<bib-id>" where <bib-id> corresponds to the ID as entered in the field name ID in table Item of the bibliographic database. Finally, the complete sequential page range, which must be a single contiguous span, contained in the tag must be specified with an attribute dar:seqpagerange="[seqfrom]-[seqto]", where [seqfrom] and [seqto] are the sequential start and end pages (start inclusive, end exclusive) of the pages present in the document.

For example, the body of a project document may be marked up as follows:

<div dar:class="document" dar:id="F301"> dar:seqpagerange="35-176">

indicating that the document includes sequential pages 35 to 176 (exclusive) of the item with Freeman catalogue number 301.

The upper page range is not essential for correct operation.

Sectional markup [Optional capability]:

A tag whose start corresponds to the beginning of a recognisable division in the original source may be decorated with the attributes dar:class="section" dar:sectiontype="[sectiontype]" dar:sectionname="[sectionname]". The values of [sectiontype] will be chosen from a restricted set, containing at least the values "chapter" and "section" and possibly others as determined as more content is met. These attributes will be applied to a tag of type div. For example, a section heading may be marked up as follows:

In practice no sectional markup has been provided for any of the Darwin Online content, the existing navigation by page sequence and label having proved generally adequate.

<h1>Chapter 3 - The Struggle for Existence</h1>

</div>

No particular conditions are placed on the location of the closing tag whose opener is marked up as a section start.

Other standards:

Other than the markup mentioned above, content will be prepared in accordance with the XHTML 1.0 (Transitional) DTD. The UTF-8 encoding will be used for all content.

Hyphens in documents where these indicate words broken in the physical page will be rendered with the Unicode code point U+002D, corresponding to the standard ASCII hyphen character. By contrast, hyphens printed for reasons other than line-breaking will be rendered as Unicode U+2010, the denormalized Unicode hyphen character. This choice has been made since it is imagined that the majority of hyphens in documents will be of the breaking kind.

Whilst normalisation of hyphens within pages has been performed in accordance with this recommendation, currently normalisation of hyphens lying at physical page boundaries has not yet been implemented. This will prove problematic for the current search index structure which breaks the document up at page boundaries for rapid location of the page location where a search hit occurs.