Friday, 23 March 2012

UKAD Conference

I gave a talk with Robert Baxter from Cumbria Archive Service at the annual UKAD conference at The National Archives on 21 March. The talk explored the potential of Linked Data in the archives, libraries and museums sector, focusing on the experience of Step change.

Key lessons/challenges from Step change and from other JISC projects King's College Archives are working on (Trenches to Triples and World War One Research) mentioned in the talk include:

  • Definining/setting up/maintaining APIs - this is potentially challenging and time-consuming
  • Need for URI definitions/syntax across the archives, libraries and museums sector - this discussion was started by LOCAH and is ongoing. A wiki will be launched soon by ULCC inviting information professional feedback on these definitions and to try and reach some consensus in the coming months
  • Place name vocabularies are a particular challenge. Step change archivists will potentially have access to some four or five sets of data about similar places - for example Geonames, English Place Names, AIM25-UKAT, GoGeo, and a local CALM place dataset. Have will they ensure consistency or that terms found across the datasets are actually talking about the same place?
  • Linked Data analysis exposes poor quality and inconsistent existing metadata. Step change is partly about providing tools that will identify discrepancies and make metadata input more consistent but the funding and management challenges of this laundry operation remain considerable
  • Establishing and supporting new live LOD services beyond existing JISC funding will be a challenge. Services go down - how will data retrieval cope with this fact of life?
  • Visualisation - this poses multiple challenges. how much information is too much information for users? How do we maintain relevancy - can the users decide themselves to some extent?
The development of the Workflow tool (Alicat) and LOD version of UKAT are well under way (Workpackages 2-3). These are informing redesign currently under way at Axiell. A meeting is planned on 29 March to review CALM development to date, prior to the commencement of analysis of Cumbria test catalogues by Robert using the new tools in a CALM development environment.

Friday, 2 March 2012

Progress report

The Step Change project is well under way and there is lots to report.

Rory at ULCC has been working hard on creating a SKOS version of AIM25-UKAT, rolled out as a service. The redesign of the workflow tool by which archvists can interrogate and analyse finding aids using semantic tools such as Open Calais, is well under way. Careful note was made of the fndings on the design and its usability by the professional survey panels convened to look at the provisional tool as part of the OMP project in 2011, but also professionals present at a meeting convened by Jane Stevenson at JISC on Linked Data and archives on 7 February, at which the first design of the tool was showcased to a wider audience by Rory. Feedback from that meeting included the need for better faceting of results, faster processing speeds, and more relevant choices available to archivists to validate the processed entities ('This is Winston Churchill, not the Churchill tank') .

The current redesign is aimed at producing a cleaner, streamlined tool for processing not only ISAD(G) records, but also more detailed and granular catalogue entries, down to single lines of image metadata, a refinement required as part of the related JISC-funded Trenches to Triples project that allows for semantic processing of digital asset management system metadata. A design meeting is scheduled with CALM for March to refine the adaptations to the CALM user interface necessary to incorporate the workflow tool. A working version of the tool and redesigned CALM system will be road tested at Cumbria and with London members of the CALM User Group once the initial design phase is completed. Chris Hilton of the CALM User Group is helping with this evaluation.

A data exchange schema has been drawn up by ULCC and CALM and a preliminary design document circulated to steering panel members. CALM backend and front end redesign work has begun.

Considerable progress has been made with Historypin to enable placenames held in AIM25-UKAT, and their corresponding collection descriptions to be displayed in a modified tab in Historypin corresponding to a broad neighbourhood such as a parish or similarly sized administrative unit. This should provide additional contextual AIM25 catalogue information to users of Historypin, and visa versa, once the service goes live ('Interested in these historical photographs? To learn more about parallel collections that may be of use, click this tab for archive/record office descriptions'). A similar read-across will be possible for the Cumbria instance of CALM, to demonstrate the value for both archives and Historypin of sharing data. Feedback from record office users (often a different audience from university archives) will determine the utility of this approach in the local setting.

This phase of the work posed a variety of challenges familiar to projects using geo-data. Latitude and longitude information needed to be generated from the placenames in order to utilise the Google maps API used by Historypin. Place name information in AIM25-UKAT was often too broad, or too specific, to be meaningful when translated into Historypin. For example a collection indexed with the term 'London' but actually concerning papers about Wandsworth would resolve to a point near Charing Cross in Google maps - misleading for Historypin users expecting to find related information in the Wandsworth part of the map. This example highlights the discrepacy between indexing granularity and the granularity necessary for adequate geo-location, and the specificity of indexing intended for collection level only (and necessarily and intentionally broader).

Historypin require very accurate scope and content information about a place and another problem to emerge was 'mixed' collection level scope and content descriptions containing references to papers about widely dispersed geographical locations. This is often the case with records reflecting lengthy and varied careers, including those of military officers posted around the globe or scientists on botanical or other expeditions. In these cases, each scope and content paragraph might read across and be pinned to sometimes wildly divergent parts of the globe. A user excited by the 'other useful information' tab for images pinned on Historypin to Oxford Street, say, might start reading a paragraph of catalogue information beginning with a description of an expedition to Borneo, and only later going on to describe Oxford Street. One possible solution for this problem is to allow archivists to highlight, select and save components of scope and content paragraphs and corresponding placenames in the index, so only the appropriate information - and only this information - is displayed in the 'other useful information' tab. This brings its own data complications, however, particularly of storage, retrieval and update of catalogue information. The project team is currently exploring work-arounds and solutions to these data accuracy problems.

Overlap with other projects

In January, JISC awarded a substantial grant to the DEEP project, based at the Department of Digital Humanities at King's College London but involving input from several universities. This project - Digital Exposure of English Place Names (http://www.jisc.ac.uk/media/documents/programmes/digitisation/econtent/econtent11_13/englishplacenamesprojectplan.pdf)- will publish a Linked Data version of the English Place-Name Society's corpus on a county by county basis, and generate a rich, historical hiearchy of names to complement services such as Vision of Britain. Step change is exploring the possibility of using relevant Cumbria/London place name data to enhance the accuracy of placename indexing via the workflow tool. An archivist would be able to interrogate the new database and select a more historically accurate and appropriate term for the catalogue entry they are working on, such as the title deeds of an individual property, and its accompanying uri. The archivist might also have a range of alternatives to draw on - a detailed and locally-specific placename list maintained in CALM, a Geonames alternative and the AIM25-UKAT placenames index, for example. Potential pitfalls here include the danger of inpappropriately mixed data points (the same places might be described in very different ways across the datasets, or the same name in two or more sets actually correlates to different places), the use of local variants and nicknames, not to mention licensing concerns for component placename lists. The user would also not necessarily notice any improvement in the front end catalogue site unless the places and their uris are actually connected to real services delivering some added functionality.

Step change has met with the JISC-funded M25 Search25 service, which is looking to create Linked Data bibliographic tools useful to London's research libraries. We explored possible avenues of collaboration involving mixing bibliographic information with archive data in London ('View these descriptions about Winston Churchill...view these book titles'). Discussion centred around using RDF versions of LCSH and Marc records, as demonstrated by the recent BL Talis project. Ideas include mixing contact/repository information such as ARCHON with library equivalents, and subject-specific read-across for sub-sets of books and archives. Discussion have taken place with TNA on data exchange and experimentation using TNA datasets.

Step change overlaps with Trenches to Triples (http://www.jisc.ac.uk/whatwedo/programmes/di_informationandlibraries/resourcediscovery/trenchestotriples.aspx) a new JISC funded project being managed by several members of the Step change team. T3, which runs until the end of July 2012, will include the adaptation of the Step change workflow tool to enable the analysis of detailed catalogue entries and the publication of the semantic output as RDF, the creation of an API for the catalogues of the Liddell Hart Centre for Military Archives and a link between First World War-related metadata and images from the JISC-funded Serving Soldier project, and catalogue entries to provide a granular read-across between different hearchical representations of the same collections: collection level-detailed file level-item/piece level from the image metadata. The project will also involve the creation of an enriched corpus of World War One terminology for insertion in UKAT and available across JISC's suite of Great War projects via the Step change AIM25-UKAT API. 

Broader discussions are under way between archive, library and museum professionals on a uri definition directory to enable cross-sectoral Linked Data data model and minimise duplication of effort.  A wiki will be created to capture any outcomes to developers and members of the wider LODLAM community, internationally.

Tuesday, 10 January 2012

Step change project plan

Aims, Objectives and Final Output(s) of the project

The step change project will develop a web service to make available a Linked Data version of AIM25-UKAT, the most up to date version of the UK Archival Thesaurus, which also includes personal name, corporate name and place indexing. It will also build on the Open Metadata Pathway exemplar, which developed a workflow tool to enable archivists to add Linked Data to catalogue entries. These improvements will be integrated with the CALM cataloguing software product, tested by a leading regional record office and rolled out as an improvement in future software releases. It will also be integrated into Historypin, the historical referencing tool.

The main objective of the project are the creation of a practical, usable service to add Linked Data to archive catalogue material via the creation of an API and its application in a real world archival setting.

The main outputs will be:

  • An RDFa version of UKAT available via an API
  • Enhanced semantic markup workflow tool that can be integrated into other services
  • Enhanced AIM25 website with examples of linking with other services from collection level desciptions
  • Redesigned CALM and CALMView UI to enable archivists to create Linked Data, express in the front end and connect with relevant external services
  • Exemplar of CALM improvements roadtested in Cumbria Archives Service with linking to relevant local sources and user testing
  • Linking between catalogue entries in AIM25 and CALM with Historypin images
  • External content such as maps and bibliographic records will be connected to archive catalogues to multiply possible research opportunities
  • Final report and lessons learned
Wider Benefits to Sector & Achievements for Host Institution

The project will deliver significant benefits to the HE Archives and wider archives sector. These will include the rejuvenation of UKAT as a useful and up-to-date subject thesaurus via the RDFa version of the index and the API. Step change also builds on the workflow tool, arguably the most important development to come out of the OMP project, which proved its worth in user testing by archive professionals. The tool allows catalogue records to be validated against external services and RDFa to be generated. It also permits enhancement of UKAT by trusted users via appropriate authentication.

These tools and lessons will be immediately applied to CALM, which is the most widely used archival software product in the UK with some 400 institutional customers, many of which are UK universities or research institutions. Archivists who use CALM will see an immediate benefit in an improved backend process to add Linked Data and index their entries, thus speeding up cataloguing and making new collections available to the public more quickly.

Users of such catalogues will also benefit by visibility to other relevant services selected by archivists in consultation with the users and user interest groups. Historypin users will see enhanced metadata assoicated with geo-located images pinned on their UK and world map, pointing users directly at the relevant parts of archive catalogues associated with those locations, thus improving accessibility to catalogues via a very popular website and app.

AIM25 will see significant improvements to enhance use of this important aggregation site for London HE and other archives, including improved linking and visability of associated websites and digital content including external authority records, maps and other content. Cataloguing processes will be streamlined in a similar way to the CALM work by allowing archivists to index their collection level descriptions more quickly and accurately by reference to the definitive UKAT thesaurus in Linked Data format.

Risk Analysis and Success Plan


Risk
Probability
Severity
Score

Archives to prevent  / manage risk
Difficulty in recruiting and retaining staff
1
3
3
Most staff are already employed by partners and this time will be bought out. The project will also distribute knowledge throughout the project to limit the effects if a staff member leaves. Project gaps will be filled by the use of agency staff or internal secondments and consideration will be given to outsourcing aspects of the technical work.
A complete test bed and evaluation cannot be implemented within the time frame
2
2
4
Project management team will closely monitor progress of objectives and outputs. If necessary, with the agreement of the Programme Manager, some activities can be re-scoped to ensure an effective outputs are achieved.  Active and regular communication with the archival community and third party service suppliers is recognised as of key importance and will be offered through regular briefings, news items and contribution to lists.
Failure to meet project milestones
2
3
6
Produce project plan with clear objectives. Continuous project assessment and close communication between project manager, technical leads, and JISC programme manager to ensure targets are realistic, achievable and focus on project goals.
CALM-AXIELL goes into receivership

2
2
4
Comparable work will be discussed with Adlib, the second largest supplier of archival cataloguing software in the UK.  Failure of this approach will be followed by the development of an application for ICA ATOM.

IPR

IPR in all reports and other documents produced by the project will be retained jointly by King’s College London and ULCC but made freely available on a non exclusive license as required/advised by JISC. All software and data created during the project will be made available to the community on an appropriate Creative Commons open licence.

Project Team Relationships and End User Engagement

The project manager is Geoff Browell, previously responsible for the Open Metadata Pathway exemplar and is Archives Services Manager at King's College London. He is project manager of AIM25, most recently responsible for delivering a major upgrade with the London Metropolitan Archives. Geoff has been responsible for other JISC, Wellcome Trust and similarly funded projects involving cataloguing, digitisation, digital asset management and app development.

The chief technical developer is Rory McNicholl of the University of London Computer Centre (ULCC), where he has worked for some ten years. He developed cataloguing and querying tools for NDAD, and has made substantial contributions to JISC projects including SNEEP, CLASM, MERLIN, PICT qand the SOAS Furer-Hamimendorf digital Collections. He works extensively with complex bibliographic and semantic metadata and was lead developer in the OMP project in 2011.

Projected Timeline, Workplan & Overall Project Methodology

Workpackage 1: Project management, planning and recruitment
Creation of the team through secondment; preparation of the detailed project plan; establishment of the project board; creation & maintainenance the project website and blog; communications with the professional community and third party suppliers; focus group evaluation; and budget management.

Workpackage 2: UKAT Service Development
This package will develop UKAT  as a web service for AIM25, CALM, Historypin and the archive and MLA sector as a whole. ULCC will develop a set of services for accessing (and manipulating) the UKAT content and make them available via a RESTful web API and make them available as a nationally supported service. The API will concentrate on the 'Read' element of the CRUD operations. There will be mechanisms for direct access to records and navigation across the thesaurus structure based on the current SKOS schema (http://www.w3.org/TR/skos-reference/). The developed API will handle searching of the content, based both on single strings and blocks of text. Responses to read operations will constitute semantically expressed data and will be available as RDF/XML or JSON. The ability to update, create and delete records via the API will also be added. The AIM25 workflow prototype developed in the course of the JISC-funded Open Metadata Pathway will be used to demonstrate the client-side of this functionality.

Workpackage 3: Development of AIM25 workflow tool
The workpackage will develop the functionality of the workflow tool and roll it out for all AIM25 partners. It will include drag and drop functionality for archivists to select one or multiple RDF-marked terms and drag them into their chosen record(s); a bulk uploader for multiple records sharing similar metadata (which will speed up the workflow still further); to further refine and deploy front end features that display sematic information; and a permission/authentication tool to validate brand new index terms and ensure that they meet National Council on Archives (NCA) rules for the construction of names. ULCC developed a prototype tool to do this reformating of names as part of the OMP. Two professional training sessions will be held at King's College for AIM25 archivists to provide instruction on the value of Linked Data and how to use the new AIM25 workflow tool.

Workpackage 4: Implementing UKAT tool in CALM and CALMView
This package will take the AIM25-UKAT API and OMP/AIM25 tools (packages 2, 3) and implement them in CALM and CALMView (the web front end). It will adapt the CALM UI and CALMView and roll-out this improvement across all CALM instances with successive upgrades. It will draw on lessons from the SALDA project which investigated Linked Data & CALM (http://blogs.sussex.ac.uk/salda/2011/02/). The development work will be carried out by CALM with some assistance from ULCC and will incorporate the AIM25-style semantic annotation tool in the UI; permit validation, analysis and selection of metadata against AIM25-UKAT; and express links with semantic services in the front end product. For CALMView, work will embed the semantic properties of stored terms into an archival record's web view using RDFa and/or creation of a SPARQL endpoint for records; and the integration of  FLISM-like plugins to link record views to related services (FLISM was developed by ULCC to express semantic metadata, see http://code.google.com/p/flism/). This work will enable the public to click on link on appropriate semantic links embedded in catalogue search returns. The work will draw on tools used in the MERLIN project undertaken by UCL and ULCC that provided an interface to the UNESCO thesaurus in the context of the HILT project.

Workpackage 5: In-service implementation of the CALM upgrade including AIM25-UKAT in Cumbria Archive Service
Th is package will develop, refine, implement and test the changes to the CALM UI and CALMView front end in a leading CALM institution and provide a demonstrator for further review. Cumbria Archive Service will provide detailed input based on real use of CALM by experienced cataloguers, will process as RDF a major sub-section of records (estate, family and local records) comprising 100,000 records, and add sematic markup to a subset of 2,000 of these record entries using the new mark-up tool. The records will then be linked to a number of existing or proposed services such as Wikipedia and Historypin and published live on the CAS catalogue website as an exemplar. 

CAS will also lead testing of the new UI and front-end at focus groups of the 14-institition North West CALM User Group including Greater Manchester County Record Office, Liverpool Record Office and Lancashire Record Office; at a focus group of the Friends of Cumbria Archives, a volunteer forum established in 1991 to support the work of CAS.  Representatives of the Cumbria County History Trust (Victoria County History – VCH - for Cumbria) will also be consulted in order to determine how VCH data might link semantically with CAS local data. These meets and discussions will inform a final customisation, visual improvement and snagging of the new CALM product. The CAS archivist will also play a key advocacy role throughout, attending key meetings, reporting back to the national CALM User Group will presenting findings at the roadmap consultation.  

Workpackage 6: Integration of AIM25-UKAT API with Historypin
This will integrate the AIM25-UKAT API with Historypin to enable links to catalogues containing places that have been marked up sematically via the API. An additional links tab will be added by the Historypin developers to sit alongside descriptions visible in the Historypin map interface to point users back to specific catalogue descriptions. Archive institutions can already bulk upload visual content to Historypin but the AIM25-UKAT API will facilitate automated linking and improve discovery. This work will provide a building block for new crowdsourcing tools currently being developed by Historypin(drawing on projects such as JISC Old Weather (http://www.oldweather.org/) to connect archive, library and museum resources with communities of users to enable users to markup and share content and stories about places.

Workpackage 7: Roadmap consultation
This workpackage is a roadmap consultation, scheduled for May/June 2012, in London which will bring together archive and broader MLA sector practitioners including AIM25 members, CALM users, Historypin, MIMAS, JISC's Discovery, IHR Digital, The National Archives, Vision of Britain, UKAD members and others; representatives of semantic search engines such as Open Calais; and potential content service providers such as Victoria County History and British History Online. It will provide an opportunity to explore and prioritise the creation of suitable services that can be connected by the tools developed in Step change and similar projects. Representation will be sought from InforM25, which has expressed an interest in sharing bibliographic data on books in London, and with which AIM25 already has links; and with the museum sector including Royal Institution Spectrum-compliant museum collections, to examine the value of mixing archival, museum and bibliographc data.

Workpackage 8: Dissemination
This package will run throughout the course of the project and comprise two AIM25 partner evaluation sessions; focus groups for Friends of Cumbria Archives and the North West CALM User Group; papers to the UK Archives Discovery Network 2012 spring conference; the Higher Education Archivists Group, ARA and M25 Library Consortium meetings. The main dissemination event will be the roadmap meeting in May/June 2012. Key staff will attend JISC programme meetings as required and the project manager will maintain a blog and news on public lists.






Nov11
Dec11
Jan12
Feb12
Mar12
Apr12
May12
Jun12
Jul12
WP1
X
X
X
X
X
X
X
X
X
WP2
X
X







WP3
X
X
X
X





WP4


X
X
X
X



WP5


X
X
X
X
X
X
X
WP6


X
X
X




WP7






X
X

WP8
X
X
X
X
X
X
X
X
X



Budget

Monday, 12 December 2011

Footsteps

This post will examine the steps that other practitioners might need to take to exploit the potential of Linked Data, based on the experiences of the OMP project team.
 
Developer liaison: The focus group stage of the project was particularly valuable for bringing technical support together with busy archivists  in a workshop setting to understand how semantic markup might be incorporated into archival workflow and best practice. The project has highlighted once again that successful development depends on a high level understanding of archival principles by technical developers facilitated through this kind of hands-on information exchange. Advice: Developers must have an appreciation of how archive catalogues are compiled by archivists and used by a variety of audiences to successfully embed Linked Data in normal business activity.
 
Front-end development: careful thought needs to be given to what adaptations need to be made to archival websites to express Linked Data entities and the connections they make with external data sources to get full value out of semantic markup. Advice: Institutional IT and web support need to be made aware of the value of Linked Data and the challenges in potential redesign of websites to express these new relationships.
 
Data quality. Semantic markup exposes the deficiencies in existing data and sufficient archival staff time must be set aside to handle inevitable audit, cleansing and editing required of catalogue and index data. Linked Data approaches can streamline workflows but are not a magic solution - knowledge of collections, context and provenance remain central to the work of the archivist. Advice: Time must be built into any programme to bring archivists up to speed with Linked Data and give them the opportunity to undertake mark-up.
 
Resources needed: the primary resources required are staff training and awareness of Linked Data and access to mark-up tools necessary to add Linked Data to catalogues in a seamless way. These tools should be freely accessible and intuitive to minimise the requirement for extensive (and expensive) training. Access to UKAT or to similar appropriate thesauri is advisable for RDF versions of subject, personal, corporate and placenames to be added to entries with minimum referral to external vocabularies (the 'research' phase of writing or editing catalogues). CALM and other software providers are currently developing embeds for these tools for their UK customers. Time taken for mark-up will differ according to the quality and length of the existing entry and the granularity of indexing but between 6-10 page-length collection level descriptions might reasonably be processed in an hour. Advice: key resources are staffing, training and IT. The potential of Linked Data provides a powerful test case for improved access to put to cataloguing funders and boost opportunity for acquiring extra cataloguing resources.
 
Prioritisation: Linked Data implementation works best when tailored to fit existing cataloguing backlogs and priorities - for example through ranking by intrinsic significance or the potential use of collections. Linked Data should not be an expensive, unrealistic add-on. Linked Data, however, provides the opportunity for enhancement and enrichment though linking out to related collections and sources. The availability or non-availability of these external sources will inevitably result in an adjustment to the markup prioritisation. Advice: follow existing plans closely and embed Linked Data markup where appropriate. Produce a 'showcase' collection(s) to highlight potential to internal and external audiences and funders.
 
Engagement: OMP involved cooperation from archivists within AIM25 in a formal workshop setting, and informerly via email lists and face to face meetings. Key enagagement partners will necessarily include: fellow archivists (what can be learnt from the experience of other information professionals?); institutional IT support (what resources will be necessary to add RDF and express changes in a public website?); senior management (how much will this cost? what are the benefits to the organisation?); users (what do they want, what do they expect? Will their teaching, learning and research experience be improved?). Advice: archivists should attend training programmes and join listservs that provide training or support on Linked Data.
 
Summary of advice:
  • Think carefully about the added value that Linked Data might bring. For example, speeding up indexing thus making closed collections more readily and speedily accessible. Write this up and quantify using test material from priority collections to provide a real-time example of its value
  • Staff and stakeholder training are a key element: identify training opportunities through JISC and other organisations, conferences and hack-days; training of new staff and cataloguers
  • Use available RDF indexing tools and embed in existing cataloguing practice. Listen out for new tools that are imminent, for example for CALM customers
  • Identify new audiences that can fully realise the potential of Linked Data. These might (indeed, ideally should) differ from existing audiences
  • Share best practice with fellow archivists
  • Collect feedback from users to inform priority list for semantic cataloguing (which data sources would be especially useful to them if connected?)
  • Showcase key collections and generate metrics to demonstrate enhanced take-up