JISC Step change will create Linked Data architecture for the UK archive sector, completing in July 2012. It draws on the lessons of Open Metadata Pathway and brings together King's College London Archives, ULCC, Axiell, Cumbria Archive Service, Historypin and the charity, 'We are what we do'. The project will use data held by AIM25 and focuses on delivering a new UKAT Web service and toolset that will allow archivists to mark up catalogues with triples and other semantic entities.
I was fortunate to attend the biennial Linked Open Data, Libraries, Archives, Museums summit in July 2015 in Sydney, Australia, and played a very small role in the organising committee. The event showcases useful projects and brings together a disparate community of experts: https://graphcommons.com/graphs/0f874303-97c2-4e53-abc6-83a13a1a2030
What is Linked Data? Linked Data is a way of structuring online and other data to improve its accuracy, visibility and connectedness. The technology has been available for more than a decade and has mainly been used by commercial entities such as publishing and media organisations including the BBC and Reuters. For archives, libraries and museums, Linked Data holds the prospect of providing a richer experience for users, better connectivity between pools of data, new ways of cataloguing collections, and improved access for researchers and the public.
It could, for example, provide the means to unlock research data or mix it with other types of data such as maps, or to search digitised content including books and image files and collection metadata. New, more robust, services are currently being developed by international initiatives such as Europeana which should make its adoption by libraries and archives much easier. There remain many challenges, however, and this conference provided the opportunity to explore these.
The conference comprised a mix of quick fire discussions, parallel breakout sessions, 2-minute introductions to interesting projects, and the Challenge entries.
Quick fire points from delegates
Need for improved visualisation of data (current visualisations are not scalable or require too much IT input for archivists and librarians to realistically use)
Need to build Linked Data creation and editing into vendor systems (the Step change model which we pursued at King’s Archives in a Jisc-funded project)
Exploring where text mining and Natural Language Processing overlap with LOD
World War One Linked Data: what next? (less of a theme this time around as the anniversary has already started)
LOD in archives: a particular challenge? (archives are lagging libraries and galleries in their implementation of Linked Data)
What is the next Getty vocabularies: a popular vocabulary that can encourage use of LOD?
Fedora 8 and LOD in similar open source or proprietary content management systems (how can Linked Data be used with these popular platforms?)
Linked Data is an off-putting term implying a data-centric set of skills (perhaps Linked Open Knowledge as an alternative?)
Building a directory of cultural heritage organisation LOD: how do we find available data sets? (such as Linked Open Vocabularies)
Implementing the European Data Model: next steps (stressing the importance of Europeana in the Linked Data landscape)
Can we connect different entities across different vocabularies to create new knowledge? (a lot of vocabularies have been created, but how do they communicate?)
This talk showcased a new product called OASIS from Synaptica, aimed at art galleries, which facilitates the identification, annotation and linking of parts of images. These elements can be linked semantically and described using externally-managed vocabularies such as the Getty suite of vocabularies or classifications like Iconclass. This helps curators do their job. End users enjoy an enriched appreciation of paintings and other art. It is the latest example of annotation services that overlay useful information and utilise agreed international standards like the Open Annotation Data Model and the IIIF standard for image zoom.
We were shown two examples: Botticelli’s The Birth of Venus and Holbein’s The Ambassadors for impressive zooming of well-known paintings and detailed descriptions of features. Future development will allow for crowdsourcing to identify key elements and utilising image recognition software to find these elements on the Web (‘find all examples of images of dogs in 16th century public works of art embedded in the art but not indexed in available metadata’).
This product mirrors the implementation of IIIF by an international consortium that includes leading US universities, the Bodleian, BL, Wellcome and others. Two services have evolved which offer archives the chance to provide deep zoom and interoperability for their images for their users: Mirador, and the Wellcome’s Universal Viewer (http://showcase.iiif.io/viewer/mirador/). These get around the problem of having to create differently sized derivatives of images for different uses, and of having to publish very large images on the internet when download speeds might be slow.
Digital New Zealand
Chris McDowall of Digital New Zealand explored how best to make LOD work for non-LOD people. Linked Open Data uses a lot of acronyms and assumes a fairly high level of technical knowledge of systems which should not be assumed. This is a particular bugbear of mine, which is why this talk resonated. Chris’ advocacy of cross developer/user meetups also chimed with my own thinking: LOD will never be properly adopted if it is assumed to be the province of ‘techies’. Developers often don’t know what they are developing because they don’t understand the content or its purpose: they are not curators.
He stressed the importance of vocabulary cross-walks and the need for good communication in organisations to make services stable and sustainable. Again, this chimed with my own thinking: much work needs to be done to ‘sell’ the benefits of Linked Data to sceptical senior management. These benefits might include context building around archive collections, gamification of data to encourage re-use, and serendipity searches and prompts which can aid researchers. Linked Data offers the kind of truly targeted searching in contrast to the ‘faith based technology’ of existing search engines (a really memorable expression).
He warned that the infrastructure demands of LOD should not be underestimated, particularly from researchers making a lot of simultaneous queries: he mooted a pared down type of LOD for wider adoption.
Richard Wallis of OCLC explored the potential of Schema.org, a growing vocabulary of high level terms agreed by the main search engines to make content more searchable. Schema.org helps power search result boxes one sees at the top of Google search return pages. Richard suggested the creation of an extension relevant to archives to add to the one for bibliographic material. The advantage of schema.org is that it can easily be added to web pages, resulting in appreciable improvement in ranking and the possibility of generating user-centred suggestions in search results. For an archive, this might mean a Google user searches for the papers of Winston Churchill and is offered suggested other uses such as booking tickets to a talk about the papers, or viewing Google maps information showing the opening times and location of the archive.
The group discussion centred on the potential elements (would the extension refer to thesis, research data, university systems that contain archive data such as Finance and student information?), and on the need for use cases and setting out potential benefits. I agreed to be part of an international team through the W3C Consortium, to help set one up.
This Dutch service facilitates the linking of different controlled vocabularies and thesauri and helps address the problem faced by many cultural organisations ‘which thesauri do I use?’ and ‘how do I avoid reinventing the thesauri wheel?’. The services allows users to upload a SKOS vocabulary, link it with one of four supported vocabularies and visualise the results.
The service helps different types of organisation to connect their vocabularies, for example an audio-visual archive with a museum’s collections. The approach also allows content from one repository to be enhanced or deepened through contextual information from another. The example of Vermeer’s Milkmaid was cited: enhancing the discoverability of information on the painting held in the Rijksmuseum in Amsterdam through connecting the collection data held on the local museum management system with DBPedia and with the Getty Art and Architecture Thesaurus. This sort of approach builds on the prototypes developed in the last few years to align vocabularies (and to ‘Skosify’ data – turn it into Linked Data) around shared Europeana initiatives (see http://semanticweb.cs.vu.nl/amalgame/).
Research Data Services project: Introduction by Ingrid Mason
This is a pan-Australian research data management project focusing on the repackaging of cultural heritage data for academic re-use. Linked Data will be used to describe a ‘meta-collection’ of the country’s cultural data, one that brings together academic users of data and curators. It will utilise the Australia-wide research data nodes for high speed retrieval (https://www.rds.edu.au/project-overview and http://www.intersect.org.au/).
Jon explained how the popular historical mapping service, historypin, is dealing with the problem of ‘roundtripping’ where heritage data is enhanced or augmented through crowdsourcing and returned to its source. This is of particular interest to Europeana, whose data might pass through many hands. It highlights a potential difficulty of LOD: validating the authenticity and quality of data that has been distributed and enriched.
Chris McDowall of Digital New Zealand
Chris explained how to search across different types of data source in New Zealand, for example to match and search for people using phonetic algorithms to generate sound alike suggestions and fuzzy name matching: http://digitalnz.github.io/supplejack/.
This 6 million Euro EU-funded project aims to make audio-visual material more accessible and has been trialled with thousands of hours of video footage, and expert users, from the BBC. Its purpose is to help users mine vast quantities of audio-visual material in the public domain as accurately and quickly as possible. The team have developed tools using open source frameworks that allow users to detect people, places, events and other entities in speech and images and to annotate and refine these results. This sophisticated tool set utilises face, speech and place recognition to zero-in on precise fragments without the need for accompanying (longhand) metadata. The results are undeniably impressive – with a speedy, clear, interface locating the parts of each video with filtering and similarity options. The main use for the toolset to date is with film studies and journalism students but it unquestionably has wider application.
The Axes website also highlights a number of interesting projects in this field. Two stand out: http://www.axes-project.eu/?page_id=25, notably Cubrik (http://www.cubrikproject.eu/), another FP 7 multinational project which mixes crowd and machine analysis to refine and improving searching of multimedia assets; and the PATHS prototype (http://www.paths-project.eu/) ‘an interactive personalised tour guide through existing digital library collections. The system will offer suggestions about items to look at and assist in their interpretation. Navigation will be based around the metaphor of a path through the collection.’ The project created an API, User Interface and launched a tested exemplar with Europeana to demonstrate the potential of new discovery journeys to open access to already-digitised collections.
The NSW State Library sought to find new ways of visualising their collections by date and geography through their DX Labs, an experimental data laboratory similar to BL Labs, which I have worked with in the UK. One visually arresting visualisation shows the proportions of collections relevant to particular geographical locations in the city of Sydney. Accompanied by approving gasps from the audience, this showed an iceberg graphic superimposed onto a map showing the proportion of collections about a place that had been digitised and yet to be digitised – a striking way of communicating the fragility of some collections and the work still to be done to make them accessible to the public.
Open Memory Project. This Italian entry won the main prize. It uses Linked Data to re-connect victims of the Holocaust in wartime Italy. The project was thought provoking and moving and has the potential to capture the public imagination.
Polimedia is a service designed to answer questions from the media and journalists by querying multi-media libraries, identifying fragments of speech. It won second prize for its innovative solution to the challenge of searching video archives.
LodView goes LAM is a new Italian software designed to make it easier for novices to publish data as Linked Data. A visually beautiful and engaging interface makes this a joy to look at.
EEXCESS is a European project to augment books and other research and teaching materials with contextual information, and to develop sophisticated tools to measure usage. This is an exciting, ambitious, project to assemble different sources using Linked Data to enable a new kind of publication made up of a portfolio of assets.
Preservation Planning Ontology is a proposal for using Linked Data in the planning of digital preservation by archives. It has been developed by Artefactual Systems, the Canadian company behind ATOM and Archivematica software. This made the shortlist as it is a good example of a ‘behind the scenes’ management use of Linked data to make preservation workflows easier.
A selection of other entries:
Public Domain City extracts curious images from digitised content. This is similar to BL Labs’ Mechanical Curator, a way of mining digitised books for interesting images and making them available to social media to improve the profile and use of a collection.
Project Mosul uses Linked Data to digitally recreate damaged archaeological heritage from Iraq. A good example of using this technology to protect and recreate heritage damaged in conflict and disaster.
The Muninn Project combines 3D visualisations and printing using Linked Data taken from First World War source material.
LOD Stories is a way of creating story maps between different pots of data about art and visualising the results. The project is a good example of the need to make Linked Data more appealing and useful, in this case by building ‘family trees’ of information about subjects to create picture narratives.
Get your coins out of your pocket is a Linked Data engine about Roman coinage and the stories it has to tell – geographically and temporally. The project uses nodegoat as an engine for volunteers to map useful information: http://nodegoat.net/.
Graphity is a Danish project to improve access to historical Danish digitised newspapers and enhancing with maps and other content using Linked Data.
Dutch Ships and Sailors brings together multiple historical data sources and uses Linked Data to make them searchable.
Corbicula is a way of automating the extraction of data from collection management systems and publishing it as Linked Data.
Day two sessions
Day two sessions focused on the future. A key session led by Richard Wallis explained how Google is moving from a page ranking approach to a triple confidence assertion approach to generating search results. The way in which Google generates its results will therefore move closer to the LOD method of attributing significance to results.
Need for a vendor manifesto to encourage systems vendors such as Ex Libris, to build LOD into their systems (Corey Harper of New York University proposed this and is working closely with Ex Libris to bring this about)
Depositing APIs/documentation for maximum re-use (APIs are often a weak link – adoption of LOD won’t happen if services break or are unreliable)
Uses identified (mining digitised newspaper archives was cited)
Potential piggy-backing from Big Pharma investment in Big Data (massive investment by drugs companies to crunch huge quantities of data – how far can the heritage sector utilise even a fraction of that?)
Need to validate LOD: the quality issue – need for an assertion testing service (LOD won’t be used if its quality is questionable. Do curators (traditional guardians of quality) manage this?)
Training in Linked Data needs to be addressed
Need to encourage fundraising and make LO sustainable: what are we going to do with LOD in the next ten years? (Will the test of the success of Linked Open Data be if the term drops out of use when we are all doing it without noticing? Will 5 Star Linked Data be realised?http://5stardata.info/)
There were several key learning points from this conference:
The divide between technical experts and policy and decision makers remains significant: more work is needed to provide use cases and examples of improved efficiencies or innovative public engagement opportunities that the technology provides
The re-use and publication of Linked Data is becoming important and this brings challenges in terms of IPR, reliability of APIs and quality of data
Easy to use tools and widgets will help spread its use; avoiding complicated and unsustainable technical solutions that depend on project funding
Working with vendors to incorporate Linked Data tools in library and archive systems will speed its adoption
The Linked Data community ought to work towards the day Linked Data is business as usual and the terms goes out of use