Date Encoding Standard for Darwin Online
v1.2 January 17th 2006
Prepared by: Antranig Basman (amb26@ponder.org.uk)
Context:
A standard for encoding dates, suitable for bibliographic records has been inherited from the catalogue of Darwin manuscripts made by Nick Gill for the Cambridge University Library. It is proposed to adopt a simplified protocol based on this for use with Darwin Online.
The documentation for the Gill format is reproduced as an appendix below.
Policy:
Dates encoded in the Gill form should be parseable, although largely through tolerance and elision of extraneous characters. Dates for use in Darwin Online will comprise:
EITHER
A single date in the format yyyy.mm.dd
OR
A single, contiguous date range in the format yyyy.mm.dd--yyyy.mm.dd
"partly undated" items will be represented with zero digits (disambiguated from the Gill format which permitted 9s)
"circa" dates will NOT be represented explicitly in the database – if importing the Gill format, the text "ca" appearing anywhere will be used to set the "imprecise" field below.
An "editorial" date will be detected through one of the characters [ or ] detected in the textual form.
An "uncertain" date will be detected through a question mark character appearing in the date field.
Storage format:
Within the database, dates will be represented as a collection of five "core fields", and two textual fields:
Name |
Type |
Explanation |
StartDate |
Date |
The start of the date range represented. |
EndDate |
Date |
The end of the date range represented. If a single date, will be equal to the start date. |
Editorial |
Boolean |
Representing an editorial date, specified by any of the ][ characters |
Imprecise |
Boolean |
An imprecise date, specified by the string "ca" |
Uncertain |
Boolean |
An uncertain date, specified by a question mark ? |
DisplayDate |
Text |
A display format for the date, suitable for an end reader. |
OriginDate |
Text |
The "origin" format for the date as imported from external source (Gill, APS, or Darwin Standard). Will be parsed to construct five core fields. |
Data entry will set values for the final two fields, with the "origin format" being used automatically by the system to infer values for the first five, which will be the form manipulated by the search engine.
Examples:
The following table shows the "normalized" forms of various imported dates. As an intermediate form of conversion, imported dates will be treated "as if" they were specified using the textual form in the centre column, although this will not be necessarily stored in the database.
Origin |
Darwin Standard |
Display Format |
[1839.01.09] & [1839.01.05--1839.01.13] <display/> [1839.01.09.ca] |
[1839.01.09.ca] |
ca.[9 January 1839] |
[1847.01.20] & [[1846.00.00]]--[1847.01.20] <display/> [1847.01.20 or before] |
[1847.01.20] |
[20 January 1847] |
[9999.99.99] & [[1840.00.00--1882.04.00]] <display/> [nd] |
[0000.00.00] |
[Undated] |
[9999].03.22 & [[1848]].03.22-|-[[1882]].03.22 <display/> [ny].03.22 |
[0000.00.00] |
[Undated] |
[1880.00.00] & [1876.00.00--1884.00.00] <display/> [1880.ca] |
[1880.00.00.ca] |
[ca. 1880] |
[1878] Dec. 11th |
[1878.12.11] |
[11 December 1878] |
[1867] Aug 24 [end. 24th. August 1867] |
[1867].08.24 |
24 August [1867] |
Display Formats:
The Gill format has established a general convention, examples of which follow:
[1871--1874?].09.10
[1850.05.25.after]
[nd]
[1858.03.11.probably]
[ny].09.08--[ny].09.11
1868.03.[06.before]
some entries in the APS catalogue use a different convention, which appear slightly more conformant with typical historical renderings – however, it should be noted that dates in the Gill format are considerably more numerous:
1838 Decr. 20th [end. March 1, 1839, wmk. 1836]
[1861] July 18th [wmk. 1860]
[?1850 January 24] Thursday Evening
Darwin Online will adopt the following standard layout for display dates, examples of which appear in the third column of the preceding table:
{2 digit day} {long month form} {4 digit year}
Unresolved Points:
i) Significance behind differing formats for uncertain dates (9999 characters vs. 0000 – query to Nick Gill).
Appendix: Historical format (Nick Gill):
All <date>s have a fully digital canonical or `proper' form, using the
basic style
yyyy.mm.dd
eg
1850.06.25
Wholly undated items are digitised as 9999.99.99, partly undated as
9999.mm.dd, yyyy.00.dd, yyyy.mm.00, yyyy.00.00 etc as appropriate.
In many cases a date contains editorial-intervention markers, viz.
square brackets and/or question-marks, in principle almost anywhere in
the string, eg
[1850?].06.[25]
18[50?].06.[2]5
[18]50[.06.?]2[5]
etc.
Some dates can only be expressed in the form of ranges. There are two
types of range: continuous, symbolised with the marker --, eg
1850.06.12--1852.08.25
which comprises every day between the two termini; and discontinuous,
symbolised with the marker -|-, eg
1850.06.12-|-1852.08.25
which comprises the 12--25 of each of the months .06.--.08. in each of
the years 1850--1852, ie a set of 9 discrete `globules' of
mini-ranges.
Thus there are only 3 basic `proper-date' types,
yyyy.mm.dd
yyyy.mm.dd--yyyy.mm.dd
yyyy.mm.dd-|-yyyy.mm.dd
Any number of components conforming to one or other of these three
types can be linked by `&' to form the actual <date> of an item.
Any <date> may if necessary have a <display/> form differing from the
canonical `proper' form, eg
<date> [9999.99.99] <display/> [nd] </date>
<date> [1871?].09.10-|-[1874?].09.10 <display/> [1871--1874?].09.10 </date>
etc. Any date containing a verbal component or qualifier necessarily
requires such treatment, eg
<date> [1850.05.26]--[9999.99.99] <display/> [1850.05.25.after] </date>
The first, or in `unitary' cases only, component of the date is the
sortpoint. Dates involving `before' and `circa' are thus treated as a
combination of "sortpoint & range", eg
<date> [1850][.05.24] & [1850][.01.01]--[1850][.05.24]
<display/> [1850][.05.25.before] </date>
to ensure that ranges are always expressed as chronologically
forward-running; further examples (presuming a default of +/-4 for
`circa'),
<date> [1850.05.25] & [1850.05.21]--[1850.05.29]
<display/> [1850.05.25.ca] </date>
<date> [1971.01.00] & [[1885.00.00]]--[1971.01.00]
<display/> [1971.02.00.before] </date>
<date> [9999].02.19 & [[1851]].02.19-|-[[1928]].02.19
<display/> [ny].02.19 </date>
<date> [9999].03.22 & [[1845]].03.22-|-[[1912]].03.31
& [[1845]].04.01-|-[[1912]].04.25
<display/> [ny].easter </date>
<date> [1890.11.00]-|-[1899.03.00] <display/> [1890s.winter] </date>
(the double-brackets in some of these being instances where open-ended ranges have been arbitrarily delimited by secondary information such as outer-dates derived from <lifespan>s, etc. These secondary editorial interventions are always distinguished by the use of double-brackets, which are not used anywhere else.)
The complex dates in the current fileset have all be analysed into an appropriate set of discontinuous ranges, as in the last 3 examples above.
Beyond this superficial laying out of various typical kinds of output, the process by which the analysis of complex dates is conducted (ie the manner in which the <display/> form is edited into the `proper' form) is impossible to describe briefly, and is hideously non-trivial.