Friday, 7 September 2012

Lessons learned

The Step change project has identified a number of useful 'lessons learned' - more will follow in future posts.

1. Data quality

The creation of RDF and linking with similar resources might expose legacy catalogue data as uneven, inadequate or inaccurate. It is likely that many existing catalogues, though adequate for basic online searching, are not up to the task in a Linked Data environment. Date ranges cited in archive catalogues are too broad to identify components of collections; geographical designations insufficiently specific or too fuzzy (does 'London' mean Charing Cross or Croydon, the City or London, Canada? Which units are being described, and are they historically accurate?). The reality is that many catalogues not only predate RDF but the internet, and arguably are not fit for purpose in a Google-enabled search environment, either being inaccessible to search engines or not optimised for web-crawling.

Next steps:

Review of links: while an archivist or librarian might be familiar with their own collections, they are likely to be unfamilar with each other's content, or content from unrelated sources (such as maps, audio-visual material or database content). A real example encountered in Step change was the join-up between archive collection descriptions and bibliographic information using the BNB, where archivists accessing the live service in CALM were often unable to identify, and therefore select for linking, the correct edition of an author's publication to match the relevant archive description by, or about, that author - the service returned ambiguous or difficult-to-interpret bibliographic data. Confronted with practical problems such as these, the professional focus group, which convened to review the markup tool enbedded in CALM, recommended the implementation of an editing stage into CALM to preview possible selections of Linked Data join-ups, in order to minimise potential mistakes and make mark-up more efficient by reducing the necessity for time-consuming corrections post facto.

Knowledge transfer: Furthermore, the linking preview problem clearly exposes the cross-disciplinary knowledge gap that hinders joint-up between collections, except at the level of broad catagories, mapped across domains. Librarians, archivists, museum curators, academic experts and GIS and data curators simply don't know enough about each other's data to make truly informed decisions that will underpin the entity relationship-identification and entity relationship-building that is at the heart of the successful implementation of Linked Data methodologies.

Outcome and next steps: Axiell is considering incorporating an improved editing tool in future releases of CALM. For the mapping component of the project for AIM25, a preview tool has been developed and installed in the Alicat cataloguing utility that uses the Google maps API and Geonames to preview the names of places in micromaps, to allow the archivist to make speedier, more accurate choices of placenames before hitting the 'save' button.

Step change's publication of UKAT as a Linked Data service helps overcome the knowledge gap as it at least provides an agreed subject, place, person and corporate name listing as a common starting point in describing certain entities. What it doesn't do is capture relationships and more work needs to be done to describe subject and domain-specific triples. A publicly-supported triplestore would be an important infrastructure development that would give professionals confidence that Linked Data is here to stay, and to encourage investment to embed in conventional cataloguing. Further steps are necessary, though, not least sponsorship of co-working between different knowledge professionals using cross-domain data - to properly document the challenges of mixing and matching library, archive and museum metadata and linking it with, say, research outputs in the arts and humanities.

The problem of inadequate catalogues is difficult to resolve - cataloguing backlogs are a higher priority than retroconversion and should a catalogue be useful to potential researchers, it is usually deemed adequate. Training should be provided to potential cataloguers to understand better the implications of online search strategies and search engine optimisation (aside from Linked Data), which are probably poorly understood by most archivists. The use of certain agreed vocabularies should be encouraged where these exist as Linked Data and the AIM25-UKAT service helps supply this need for an indexing tool that coincidentally creates RDF without archivists necessarily being aware that this is happening. Some agreement should be reached on other specialist vocabularies, name authorities and place data (including historical places - at least in the UK) to create established hubs. These will potentially be more robust and avoid a fragile cats cradle of APIs prone to network disruption, and serve as trustworthy and authentic points of reference.

2. The value of public-private partnership

Step change was built on a good working relationship with a charity (We are what we do - responsible for Historypin), and a commercial vendor (Axiell). The rationale behind their involvement was that for Linked Data use to become widespread in libraries, archives and museums, it should be made available through the trusted suppliers upon which professionals have come to depend. Good will on both sides and in both cases enabled the team to overcome serious problems with enforced development staff absences. These challenges do point to a potential over-dependency on a relatively small number of experts able to combine knowledge of RDF technologies with knowledge of library, archive and museum data and practices.

The Axiell experience demonstrated, through the focus group and demo at the national CALM user group, and perhaps unexpectedly, that there is substantial interest from the archive community for Linked Data tools and understanding of their utility.

Next steps: Axiell is releasing the embedded Alicat markup tool in CALM version 9.3 and has agreed to further iterations and improvements in future releases. Crucially, these will be timetabled in response to user feedback.  Similar partnerships ought to be explored with other software suppliers, such as Adlib and a meeting is planned with the UK Adlib user community and representatives from Adlib with this in mind.

3. Technical limitations of APIs

Considerable staff time needed to be set aside for dealing with poor quality responses to queries and trying to finetune services. Service reliability is essential if Linked Data approaches are to work. A significant obstacle were local firewalls and authentication protocols and persuading local IT to address these concerns. Change requests for an experimental Linked Data project involving archive catalogues were understandably deemed to be low priority. They also carried a cost implication that needs to be factored into budgets.

Next steps: the cost implications of technical implementation need to be quantified and documentation published to provide institutional IT with context to make informed technical decisions - and persuade managers to authorise expenditure.

4. Value of co-operation

Step change sought to build a number of professional relationships to help leverage goodwill and kickstart a more strategic appreciation of the types of datasets that ought to be output as RDF. So far, datasets have mainly been confied to the library and museum sectors and have been created in an ad hoc way by interested experts, rather than with end users in mind. Discussions were held with The National Archives with a view to using the National Register of Archives dataset as a prototype name authority service. This, and other heavily used TNA services such as the Manorial Documents Register, would prove particularly valuable to the types of local authority archives participating in Step change, with their focus on local history. Test data relating to women in the NRA was released via TNA Labs through Talis' Kasabi service. The withdrawal of support for the service at very short notice provides a salutory lesson that the availability of commercial services cannot be taken for granted. The National Archives  is currently renewing its backend systems and will review the status of the NRA, MDR, Archon and other databases in due course.

Discussions were held with other interested parties, not least in the area those representing geographical data. Testing is due to commence with historical placenames supplied as part of the JISC DEEP project concerning the English Placenames Survey, relating to Cumbria, with a view to correcting locating and mapping catalogues.

As part of the CALM development work, a set of configuration instructions were published by Axiell to enable archivists to execute XSLT tranforms and link to other services as they become available. The British Museum collections were identified as a good contender with which to test out these instructions, on account of the high quality data that they provide and the mutual political benefit of local institutions to be able to demonstrate a link back to a major national collection held in London, and to the BM to be able to demonstrate that museum objects of local significance are being accessed be local people in an intelligent 'Linked Datery' way (for example mapping archaeological finds in the collection and linking with local catalogues or historical society publications). Work on testing this approach is still ongoing and conclusions will be presented in a future post.

Next steps: more cross-sectoral cooperation and scoping is required to think strategically about the kinds of datasets that different audiences need as Linked Data - archivists and different types of users - schools, the general public, genealogists, academics, researchers. Large national datasets that culd benefit from unlocking inclde the Clergy of the Church of England Database, British History Online and the Victoria County History. Testing is due to begin with DEEP data and ongoing with BM data.