Archaeotools, data mining, facetted classification

Project: Research project (funded)Research

Project Details

Description

The Archaeotools project built upon previous work to develop tools to allow archaeologists to share and analyse datasets and publications within a collaborative environment. It had three interrelated objectives, each represented by a distinct workpackage.

The first aim was to index the ADS structured database of over one million metadata records describing sites and monuments in the British Isles, according to three criteria: When, What and Where. The project then used techniques of facetted classification, derived from information science, to allow users to navigate the three-dimensional space thereby created, allowing them to explore the data sets held in the three-dimensional space, and to make links between specific records. A map-based interface was developed to allow the spatial dimension to be explored.

Secondly the project employed techniques of natural language processing to allow automated tools to search within documents for terms which are part of known classification schemes, adding them to the facetted index, and providing much deeper and richer access to unpublished archaeological literature. Although this literature forms the primary record of most archaeological investigation within the UK, the level of scholarly and public access has hitherto been severely curtailed, imposing a major constraint on archaeological research.

Thirdly, these tools were also employed to investigate whether it is also possible to identify and harvest index terms within older antiquarian literature as represented by back runs of archaeological journals currently being digitised and being made available online. As site reports in this older literature rarely give precise geospatial coordinates it was necessary to use natural language processing to recognise and harvest place names. The placenames were supplied to the GeoXwalk, an online gazetteer of names, in order to return precise grid coordinates which could be added to the index.

At the end of the project we have created a major sustainable resource for archaeological research and made it available to users via a completely revised interface to all ADS resources. We have also been able to make recommendations for the future format and indexing of grey literature, and to draw lessons for the wider humanities e-Science community.

Key findings

Workpackage 1: We were able to map the 1 million records to thesauri by a combination of automatic rule-based expressions and manual techniques. The 'when' facet provides an example of the success of this combined approach. There is a large number of ways in which archaeological dates and date ranges can be written, e.g. 1066, 1001-1100, 11th centuary (sic), C11, 11C, eleventh century. Most of these were mapped directly to MIDAS-defined date ranges. Analysis initially recovered 457 instances of irresolvable dates, equating to 114,505 records which could not be classified. After automated processing this was reduced to 148 concepts and only 7,528 records. This is a manageable number to correct by manual intervention. The variety of uncontrolled terminology used for the 'What' facet, combined with a significant number of records with no subject information proved more intractable, but was not a serious problem as most records still appeared under either the 'When' or 'Where' facet. In total, of 1,001,595 records submitted for classification, 995,907 appeared in at least one facet, leaving only 5688 record totally unclassified.



Workpackage 2: Relatively high levels of success were achieved when the same techniques were applied to the sample of 1000 semi-structured grey literature reports. The greatest problem encountered was that of distinguishing between 'actual' and 'reference' terms. As well as the 'actual' place name referring to the location of the archaeological intervention, most grey literature reports also refer to comparative information from other sites, here called 'reference' terms. The IE software returned all place names in the document, masking the place name for the actual site amongst large numbers of other names. However this was solved by adopting the simple rule that the primary place name would appear within the 'summary' section of the report. If it was not possible to identify a summary then the first 10% of the document was used instead.



Workpackage 3: The 3rd strand of the project was to focus IE on the almost entirely unstructured digitised version of the PSAS. Despite the highly unstructured nature of the text and the antiquated use of language we were surprised to find that once trained on the grey literature reports the IE software achieved comparable levels of success with the antiquarian literature. Problems were encountered with more synthetic papers and other types of document, but where the primary subject of the article was a fieldwork report then it was possible to identify the key 'What' 'When' and 'Where' index terms using the same approach as adopted with the grey literature. After discounting prefatory papers the PSAS corpus was reduced to 3991 papers referring to archaeological discoveries. By applying the rule that the actual What, Where and When would appear in the first 10% of the paper it was possible to identify a subject for all but 277 of the papers, although there was less success with a geospatial location (627 papers with no location), and least success with period terms (2056 papers with no When term).

StatusFinished
Effective start/end date1/09/0710/02/10

Funding

  • AHRC: £321,817.00