Monday, 12 December 2011

Footsteps

This post will examine the steps that other practitioners might need to take to exploit the potential of Linked Data, based on the experiences of the OMP project team.
 
Developer liaison: The focus group stage of the project was particularly valuable for bringing technical support together with busy archivists  in a workshop setting to understand how semantic markup might be incorporated into archival workflow and best practice. The project has highlighted once again that successful development depends on a high level understanding of archival principles by technical developers facilitated through this kind of hands-on information exchange. Advice: Developers must have an appreciation of how archive catalogues are compiled by archivists and used by a variety of audiences to successfully embed Linked Data in normal business activity.
 
Front-end development: careful thought needs to be given to what adaptations need to be made to archival websites to express Linked Data entities and the connections they make with external data sources to get full value out of semantic markup. Advice: Institutional IT and web support need to be made aware of the value of Linked Data and the challenges in potential redesign of websites to express these new relationships.
 
Data quality. Semantic markup exposes the deficiencies in existing data and sufficient archival staff time must be set aside to handle inevitable audit, cleansing and editing required of catalogue and index data. Linked Data approaches can streamline workflows but are not a magic solution - knowledge of collections, context and provenance remain central to the work of the archivist. Advice: Time must be built into any programme to bring archivists up to speed with Linked Data and give them the opportunity to undertake mark-up.
 
Resources needed: the primary resources required are staff training and awareness of Linked Data and access to mark-up tools necessary to add Linked Data to catalogues in a seamless way. These tools should be freely accessible and intuitive to minimise the requirement for extensive (and expensive) training. Access to UKAT or to similar appropriate thesauri is advisable for RDF versions of subject, personal, corporate and placenames to be added to entries with minimum referral to external vocabularies (the 'research' phase of writing or editing catalogues). CALM and other software providers are currently developing embeds for these tools for their UK customers. Time taken for mark-up will differ according to the quality and length of the existing entry and the granularity of indexing but between 6-10 page-length collection level descriptions might reasonably be processed in an hour. Advice: key resources are staffing, training and IT. The potential of Linked Data provides a powerful test case for improved access to put to cataloguing funders and boost opportunity for acquiring extra cataloguing resources.
 
Prioritisation: Linked Data implementation works best when tailored to fit existing cataloguing backlogs and priorities - for example through ranking by intrinsic significance or the potential use of collections. Linked Data should not be an expensive, unrealistic add-on. Linked Data, however, provides the opportunity for enhancement and enrichment though linking out to related collections and sources. The availability or non-availability of these external sources will inevitably result in an adjustment to the markup prioritisation. Advice: follow existing plans closely and embed Linked Data markup where appropriate. Produce a 'showcase' collection(s) to highlight potential to internal and external audiences and funders.
 
Engagement: OMP involved cooperation from archivists within AIM25 in a formal workshop setting, and informerly via email lists and face to face meetings. Key enagagement partners will necessarily include: fellow archivists (what can be learnt from the experience of other information professionals?); institutional IT support (what resources will be necessary to add RDF and express changes in a public website?); senior management (how much will this cost? what are the benefits to the organisation?); users (what do they want, what do they expect? Will their teaching, learning and research experience be improved?). Advice: archivists should attend training programmes and join listservs that provide training or support on Linked Data.
 
Summary of advice:
  • Think carefully about the added value that Linked Data might bring. For example, speeding up indexing thus making closed collections more readily and speedily accessible. Write this up and quantify using test material from priority collections to provide a real-time example of its value
  • Staff and stakeholder training are a key element: identify training opportunities through JISC and other organisations, conferences and hack-days; training of new staff and cataloguers
  • Use available RDF indexing tools and embed in existing cataloguing practice. Listen out for new tools that are imminent, for example for CALM customers
  • Identify new audiences that can fully realise the potential of Linked Data. These might (indeed, ideally should) differ from existing audiences
  • Share best practice with fellow archivists
  • Collect feedback from users to inform priority list for semantic cataloguing (which data sources would be especially useful to them if connected?)
  • Showcase key collections and generate metrics to demonstrate enhanced take-up

Monday, 19 September 2011

Next steps

Previous posts have highlighted some of the benefits of the project.

Key findings include:

  • An approach to adding supplementary, semantic, metadata that builds on the work practices of archivists rather than imposing a new layer of work
  • Semantic analysis can supplement existing indexing, making its creation faster, more accurate and more consistent
  • Existing natural language processing services such as Open Calais still lack archival-LMA related vocabularies, so have some way to go before they 'learn' the language of libraries, museums, archives, galleries and the research environment
  • Opportunities for enhancing discoverability with roll-over features and drop downs to make Linked Data of mainstream value to users and not the preserve of a minority of technical experts
Next steps
  • Further refinement of the editing interface to make mark-up clearer and with features such as bulk analysis of ISAD(G) records
  • Further testing of sample data across key natural language processing services to ask certain questions - what are they missing? - how accurate are they? - how could they be improved and what role would JISC have in this?
  • More testing of front end delivery pages in AIM25 using external data, particularly geographical and name authority data via projects such as Linking Lives
  • Moving beyond AIM25 to link archive descriptions with other services via an API. Linked Data is only useful if users can connect and aggregate, mix and mash
  • Using services to mix archival, bibliographic and museums content information, and to cross-walk with broader content used by wider audiences (such as genealogical data, metadata for digitised collections - eg newspapers, mapping or crowdsourced data) 
  • Surveying which services need to be created with existing database records, that can then be linked together. This should be user and service-driven with input from major national institutions such as the TNA and BL, as well as JISC

Monday, 5 September 2011

Some assessment about the scalability and robustness of the processing against OpenCalais

The title of this post is a question from our evaluator David Kay that I somewhat failed to answer in my last post so here goes.

Scalability of OpenCalais use (some numbers)

We are using the Open Calais web service (http://www.opencalais.com/documentation/calais-web-service-api) on a free basis. OpenCalais allows for 50,000 transactions per day per user on this basis. Anyone using the OMP workflow prototype will be regarded as the same user. When running the analysis in the workflow each ISAD(G) element that is selected for analysis represents a single openCalais transaction.

There are currently 23 text boxes that can be selected for analysis. However during focus group meetings with the AIM25 archivists it was suggested that only a 2 or 3 or these would be likely to have potentially meaningful analysis.

In the last 9 or so years AIM25 has accrued records for 15,335 collections. So some back of envelope maths would tell us:

Rate of record addition: 15,335 / 9yrs that's roughly 4.5 records/day

Even in the unlikely event that someone analysed all 23 text-areas, that would be just over 100 transactions per day. So there is some wriggle room in the 50,000 limit for edits, re-analysis, etc.

Of course the reality of how archivists use AIM25 is probably not very well represented by those numbers. The potential constraint placed on archivist throughput due to openCalais analysis (ie max records that could be analysed in a day) would be about 2,174. I'll let others who are more qualified to comment on whether there has ever been, or may be in the future, a throughput that exceeds that.

Robustness of processing against OpenCalais (some churn)

The analysis does take some time and as one would expect, the greater the number of elements being analysed and the length of the text blocks sent for analysis will increase the time taken. The prototype leaves the annotation and processing of the result RDF to the javascript element of the AJAX process. This means that there is reliance on the performance of the client machine which is an unknown.

For the largest block of text in the current system (56,544 characters) the browser-side processing did become mired in "Unresponsive script" messages. The request to openCalais was not the thing causing the lag, so the culprit was the browser-side processing of the result. All this would suggest that more of this post-response processing should be pushed over to the server.

A move to more server-side processing would also improve extensibility of the framework. Server-side brokerage of results from a range of services would allow for a more consistent response both for AIM25 workflow and for any potential third party clients.

Friday, 2 September 2011

The OMP prototype ISAD(G) editor - How it works

This post describes the prototype workflow tool. How it can be used for recording AIM25 collection level records and some associated technical issues.

General aims of prototype

  1. To improve on the usability of the existing offering
  2. To make use of semantic annotation within the tool
  3. To use linked data to enhance the user experience of the AIM25 website

Detailed aims

Aim 1.

  • a) Eliminate the need for archivists to use mark-up
  • b) Integrate the indexing process with the metadata recording process
  • c) Reduce page scrolling and generally improve usability

Aim 2.

  • a) Analyse the textual input against existing authoritative sources both external and internal
  • b) Suggest and record indexing terms derived from the analysis
  • c) Record the semantic properties of terms

Aim 3

  • a) Mark-up the semantic properties of indexed terms both within the ISADG display and within the “Access points” lists
  • b) Provide links to related services based on the semantic properties of the terms
Technological details

Aim1

For the prototype we took a snapshot of the AIM25 database and put this on the OMP test server (http://data.aim25.ac.uk).

An alternative workflow for adding and editing collection level records was built using in PHP using this snapshot as a data source. The Javascript framework jQuery was also used to control on-screen actions and provide a bit of dynamism. The ISAD(g) elements are grouped into the areas as described by documents like this http://is.gd/GEXdG2 each area given a tab.

becomes


The access points are displayed on the right. Terms are colour coded according to the four term types:
  • person
  • place
  • organisation
  • concept
Each term classified under these types are represented by the OMP data-browser which parses RDF that is in-turn derived from the existing database (more on that later). Archivists can remove access points by dragging and dropping terms in to the dustbin icon.

Aim2

The prototype workflow uses one external service (openCalais) directly to analyse text and suggest useful terms for indexing. AIM25's existing index is also interrogated, this dataset includes the UK archival thesaurus. As a result an AIM25 text analysis service was developed.

For rapid development this service runs boring old SQL on the existing AIM25 data tables, but as there is already a mechanism to transpose this data into RDF (and more) a more robust semantic solution is theoretically a short hop away.

Archivists can use check-boxes by the side of each textarea in the workflow form to select ISAD(G) elements for analysis. The selected text is sent for analysis by one or both of the services and results are displayed in two ways.
  • As embedded mark-up in the "textarea"
  • As term lists in the Analysis/Indexing area

Above is an example of a list of terms returned from the AIM25 service . The term "Weaving" is in the process of being added as an access point for this record.


Here we see the same results embedded in the text. These are a smaller set as they only include the exact matches. When saved, terms not added to the access points are stripped out. Those that remain can be represented in context as RDFa.


Here the results returned by OpenCalais are embedded and below they are displayed as a list so that they can be added to the access points. Also below are the results of a direct lookup on the AIM25 service so that archivists can add access points for terms that do not appear in the text.



Did we achieve any of aim3? More to follow on this soon...

Monday, 22 August 2011

Costs & Benefits

A key question arose throughout the project, not least in the two archivists' focus groups - is Linked Data worth the input of professional staff time? From the front end perspective, are the improvements for users - enriched catalogues published more speedily and improved, automated, linking with external services - sufficient to justify the extra effort required from staff? Aren't Google, Autonomy/HP and other large corporations that manage huge quantites of data doing enough already, or will do very soon (a Google 'Linked Data' button, anyone?). A fundamental point is that archivists are under enormous pressure to justify and quantify potential benefits of Linked Data to senior management through the simplication of often confusing and obscure terminology and by the use of exemplars and online test areas.

The OMP project showed that initial professional scepticism can be overcome if Linked Data can be simply defined and the benefits clearly set out. Archivists will use Linked Data if a service or services are provided that automate of simplify mark-up or the semantic process more generally and embed it within existing cataloguing workstreams. Ideally, these can be built out of trusted aggregations, authorities or cataloguing systems such as the Hub, AIM25, TNA, CALM or ATOM. They are less likely to use Linked Data if it is perceived to be a complex, though potentially useful, add-on requiring detailed specialist knowledge and delivered without support or guidance ('built it and they will come'). The ability to retroconvert legacy catalogues and CLDs with Linked Data through automation against OpenCalais and other engines will help sell Linked Data more effectively, as can validation of metadata created out of mass digitisations and OCR.

The OMP project has underlined the value of Linked Data in a number of ways:

  • Increased access and discovery
  • Increased use and return on investment in cataloguing (speeding up cataloguing, enabling tools that require an archivist to locate and link information - for example indexing, finding already-existing authority records and linking to them; finding suitable subject terms; locating places from geonames or similar)
  • Enhanced ability to justify expenditure on services and resource development (improved web-hits and connecting with heavily used services)
  • Exposure of information to novel and different uses (Combining ALM collections for the delivery of services, including commercial services - apps, exhibitions, mapping, new tools etc)
The specific benefits, as demonstrated via AIM25 are:

Updated workflow interface including:
  • Reduction of the requirement for archivists to input HTML
  • Reduction of the on-screen size of the form
  • Integration of the process of selecting access points
  • Automatic semantic annotation to aid selection of classifying terms
  • Authority lookup (internal and external - UKAT, GeoNames, etc) to improve rigour of metadata
Semantic rendering of the classification terms used by AIM25 (separate from the AIM25 access-points records):
  • SKOS representation of AIM25-UKAT data
  • RDF for AIM25 people, families and corporate names
  • GeoNames representation of AIM25 place data
Use of RDFa where available to enhance the public interface of AIM25
  • Semantic lookup allowing users to further explore definitions and instances of terms based on the properties defined during the workflow process.

The main business case is two-fold: adding value and boosting efficiency. Archivists are very attracted to the idea of enabling UKAT in Linked Data but as an active service like OpenCalais, not a look-up. AIM25 has developed a SKOS version of UKAT and a workflow tool that would link from a revised AIM25 data entry template to a LD UKAT.

Of place, personal name, corporate name and subject, subject terms are arguably the most subjective, requiring the archivist to exercise judgment on the preferred term with the collection and potential users in mind. OMP has shown that subject terms throw up the least accurate semantic returns from a linguistic analysis service such as OpenCalais (places can often be matched with absolute precision, as can personal names). OMP has improved professional efficiency by developing a hover tool to enable the archivist to select a preferred subject term from UKAT or via connecting to LD versions of LCSH/NRA and to add this term or terms to their new catalogue/CLD.

Without such automation, Linked Data won't be embedded or the data linked will be limited in scope. Flexibility is key. Focus group archivists concluded that they need the ability to analyse as much or as little of a description as they need, and to reach that faceting decision as speedily as possible - selecting the most important entities that require linking in any body of text, and fields (just 'creator', 'institution' etc or terms within Scope and Content or Admin/Biographical?). The value of broader authority data was reiterated by the archivists - analysis should not be limited to Scope and Content. A fundamental point is that back-end Linked Data enhancement works best when it works with the grain of professional practice - pragmatically and speedily.

The OMP approach is innovative in that it offers further exposure of data - and all AIM25 data has been processed as part of the project. Sustainability will be maintained going forward either by periodic manual data dumps into OpenCalais or by automated calendared refreshes - the same approach could be envisaged for LD UKAT as a national service plugged into local systems such as CALM. Improving the OpenCalais vocabularly by importing archive-specific terms is crucial to the success of mark-up. Analysis of the catalogue data is only valuable if OpenCalais learns from archivists. Until this happens, the breadth of vocabulary will limit the scope of the mark-up. It is also worth putting pressure on the main suppliers of archival cataloguing software to encourage them to embed support in periodic upgrades.

Experimentation with NRA data is ongoing - this will test how difficult it would be to build an authorities service off the NRA/ARCHON. The results will be described in a separate blog post.


Thursday, 28 July 2011

Licensing

Linked data, amongst the many challenges it presents, requires licensing which is appropriate to its intended uses. As the compilation of databases is not regarded as a creative act under at least US law, the Creative Commons licence is probably not appropriate for licensing linked data. Instead, the Open Data Commons licences  (http://opendatacommons.org/licenses/) defined by the Open Knowledge  Foundation appears a more appropriate choice for this purpose.

Open Data Commons includes three licences: the Public Domain Dedication and License (PDDL), which places the data in the public domain and waives all rights, the Attribution License (ODC-By) which allows the sharing and adaptation of the data provided it remains attributed, and the Open Database License (ODbL), which allows the same rights provided any adaptations are distributed under the same licence.

Links to these licences are provides as RDF triples on the website: for  instance:-
·                 rdf:RDF
·                 xmlns:cc='http://creativecommons.org/ns#'
·                 xmlns:rdf='http://www.w3.org/1999/02/22-rdf-syntax-ns#'
·                 xmlns:dcq='http://purl.org/dc/terms/'
·                 cc:License rdf:about="http://opendatacommons.org/licenses/odbl/1.0/">
·                 cc:legalcode
·                 rdf:resource="http://opendatacommons.org/licenses/odbl/1.0/"/>
·                 dcq:hasVersion>1.0</dcq:hasVersion>
·                 cc:License>
·                 rdf:RDF>

to define the Open Database License. They can therefore be readily incorporated into any linked data application.  Discussions with partner institutions within AIM25 will tease out the preferred licence. It is expected to be ODbL.

As a matter of record a statement is being added to all descriptions reflecting AIM25 as the origin of the data, that AIM25 is a partnership in which contributing members have rights, but that the data is otherwise freely accessible for reuse as it has been since inception some 11 years ago. (see for example the quantities of AIM25 created entries for the Women’s Library now also offered in the Hub.) The statement will reference the use of a Open Data Commons licence. 

An issue still to be addressed however is thesaurus support. UKAT is based on UNESCO which gave permission for development but we assume further permission will be required for further use and embedding.  The same applies to MESH, and Gay and Lesbian and other vocabularies developed by partner organisation.  We have also used Getty Arts and Architecture selectively. 

Gareth Knight and Patricia Methven

Wednesday, 27 July 2011

Second Archivists' Focus Group

A second archivists' focus group was convened on Wednesday 27th July. In attendance were representatives of Senate House Library, Wandsworth Heritage Services, London School of Economics, ULCC, King's College London, London Metropolitan Archives, the Institute of Education and the National Archives.

The group reviewed progress since the last meeting and viewed a demo of the new back end editing area of AIM25. The back end area allows individual, several or all ISAD(G) fields in collection level descriptions to be analysed against the existing AIM25 version of UKAT, a fuzzy match option against UKAT to identify synonyms and against OpenCalais using the OpenCalais service. These returns are listed alphabetically alongside the record in collapsable lists and these entities also highlighted in the text using a colour-coding formula to distinguish subject, place name, personal name and corporate names, and triples where those were identified. Analysis normally takes a matter of seconds though timing-out can occur with longer records.

A mouse roll-over feature has been inserted for each term in the text, allowing users to identify the particular attribute of the term in a tick box drop down menu, using a connect to one or more external services including Geonames (is 'London' the capital of the UK, a place in Canada, an author or part of a corporate name?). Triples can also be interrogated in roll-overs in order that the editing archivist might validate or clarify these entities. These choices can be saved and then exported. The enhanced content will be expressed in a new front-end delivery for the test records that demonstrate linking with external services, in order to enhance the user experience by pulling together reliable external information on a place, name, subject etc relevant to that collection.

The debate centred on how the editing process can be speeded up - for example by 'signing-off' the capital-of-the-UK version of London for all examples of 'London' across all ISAD(G) fields after review of the first instance ('treat all subsequent examples of 'London' in this record as the capital'). Linking with the NRA is desirable to identify authority terms set out in NCA format. Linking with Library of Congress was raised as an important deliverable in order to maximise the opportunity for synergy between archive and library descriptions, particularly in local authority record offices.

The question of updating UKAT in an ongoing fashion was raised - maintaining an RDF version of AIM25 UKAT must require minimum ongoing effort given constrained budgets and workloads. Analysis reveals the limitations of the existing thesaurus but also the possibility and desirability for external services like OpenCalais to be enhanced by input from ALM thesauri and vocabularies. This requires a conversation between JISC or key UK institutions and OpenCalais and similar services.

Next steps are to improve and tidy the editing area (for example by changing colour coding), plugging this into the front-end for test records and exploring NRA/ARCHON collaboration.

Monday, 18 July 2011

Archivists' focus group

King's College Archives recently hosted a focus group comprising leading London archivists familiar with using AIM25. The purpose of the focus group was to understand how Linked Data approaches might speed up the behind-the-scenes editing work of the archivist and improve the front-end user experience. Representatives of Senate House Library, the London School of Economics, Wandsworth Heritage Services, the London Metropolitan Archives, the British Postal Museum, the Institute of Education and University of London Computer Centre were in attendance. Development work on new AIM25 records was showcased.

Real-time use of OpenCalais was demonstrated and tested by members using sample data and the results compared. Subject-term creation was shown to be an area of potential concern - OpenCalais was developed by Reuters as a news and current affairs-support service and terms tend to reflect this focus. More input from archive vocabularies was called for to enable OpenCalais's corpus to be enriched with Higher Education and other terminology. It was also suggested that Linked Data could provide fuzzy matching between formal if rather arcane UNESCO-style subject terms and terms that are in more popular use, to encourage discovery and take-up. It was suggested that the UK Archival Thesaurus could be enhanced and made available in a SKOS version.

The practical use to hard-pressed archivists came up time and again as a topic of conversation. Most archivists have neither the time nor budgets to engage in experimentation but need practical tools that they can plug into their work without fuss. Quantifying the benefits of Linked Data is vital to sell the approach to funders and institutional management. Cross-domain services are an important attraction in surfacing and linking archive information with books and museum content. The benefits of linking to Wikipedia services (DPedia) were raised - Wikipedia lies at the centre of the Linked Data universe. Biographical content could be imported wholesale from other sources and adapted for use in a particular record, which would save time researching and writing one from scratch.

The plans of proprietary suppliers like Axiell and Adlib was raised as an issue - are they planning to incoporate Linked Data tools in future versions of their archive management software? The role of Google was discussed. Do they have any Linked Data plans and if not, why not?

The issue was raised of which fields in ISAD(G) to include in Linked Data work. It was argued that focusing only on Scope and Content was a mistake, not least because of the value of authority records (Admin/Biographical) and related records fields. Linking to the NRA to surface related collections was discussed.

Indexing was discussed by panel members. Editors of AIM25, the Archives Hub or similar tools should be able to draw on Linked Data to improve or enhance the personal, corporate and place names of new and existing records (and the ability to retrospectively run existing records through OpenCalais was flagged as an important requirement - archivists are more likely to embrace LD if they can painlessly re-index their current content). Linked Data provides the opportunity for more automation  and speedier indexing, which are particularly useful for smaller archives without cataloguing expertise.

Next steps included further development on the indexing tools in order to compare workflow with traditional methods; build a prototype front-end delivery system to enhance collection level descriptions and engage in conversation with Google and others to identify best practice.

Tuesday, 10 May 2011

URIs for AIM25 access metadata

We've been giving some thought to a suitable URI scheme to adopt for this project which could mesh with the requirements of the current AIM25 metadata requirements.

The current AIM25 system contains four types of access records: personal names (Person), corporate names (Organisation), subject (Subject) and geographic names (Place). To allow semantic linkages to be formed, we shall need a coherent set of URIs that can handle all of these.

For Subjects, one option may be the UKAT thesaurus which is available as SKOS RDF. Each concept here has already been assigned a URI: for instance, for 'Poetry' this is http://www.ukat.org.uk/thesaurus/concept/525. Note that UKAT is no longer being edited, however, and so may become out-of-date in the future.

For place names, there are several gazeteers available: the Archives Hub recommends the Getty Thesaurus (http://www.getty.edu/research/conducting_research/vocabularies/tgn/index.html),
but there is not yet a set of URIs authorised by the Getty (although they are working on this).

The LOCAH (Linked Open Copac Archives Hub - http://blogs.ukoln.ac.uk/locah/) projectproduced a set of guidelines for URIs which look very useful. For each of these categories they would take this form:-


Personal and corporate name names

LOCAH recommends the following format for personal names:-

{root}/id/person/{rules}/{person-name}

so, once we have decided on a suitable root - let's say for the moment http://data.aim25.ac.uk/, for Burns | John | 1774-1868 | surgeon, we'd have

http://data.aim25.ac.uk/id/person/ncarules/burnsjohn1774-1868surgeon.

Similarly for corporate names, we'd have:-

http://data.aim25.ac.uk/concept/organisation/nra/stthomasshospitallondon

For geographic names, we'd have:-

http://data.aim25.ac.uk/id/place/ncarules/grimsbylincolnshireta2709


and finally for subjects:-

http://data.aim25.ac.uk/id/concept/aim25subjects/medicalsciences


These seem to be viable options, although a final decision has yet to be reached.

Friday, 6 May 2011

A machine readable layer for AIM25

I've been busy on the AIM25 test server adding a machine readable layer.

Of course AIM25 has for a long time offered the metadata held for each collection as EAD. I've taken this as my jumping off point and had a go at adding some more formats to the AIM25 arsenal that will hopefully be of user to any silicon based users of the AIM25 service.

There are still a few screws to tighten but hopefully this work will represent a useful tool for the OMP work to enrich the AIM25 metadata.

---

First off I wanted to mimic the browsing structure that us humans take for granted as we make our way from the homepage to collection page on the website. For this we used Encoded Archival Context (EAC) to list and describe the Institutions and their Collections.

Next we wanted to extend the work started by Gareth and Richard using semantic web services at the collection level. Once at this level we can access EAD (as we always could), to this we added the dynamically generated output from OpenCalais (OCS) and a modified version of EAD with the OCS output embedded in the content (EAD aft. OCS). This latter is also dynamically generated.

Lastly we added some browser-side scripting to the original HTML pages to highlight terms identified by openCalais. All of the above uses openCalais dynamically so be patient. Obviously the goal would be to use a triple-store generated using OC at the point of creation (and change) of records.

This work so far is really a demonstration of some possible ways of expressing and enriching AIM25 content. It is by no means an exhaustive (or even authoritative) list of possible formats, but we hope it will serve to make tangible some of the ideas we've been discussing over the past month or so.

Thanks to our firewall-wallahs you can now browse AIM25-OMP here (thankfully in HTML too).

If nothing else this has been a good exercise in getting to know AIM25 a bit better and whipping my XSLT into a useful shape and of course dipping my toe in the semantic ocean.

Thursday, 24 March 2011

Semantic Analysis of AIM25 EAD


Rory and I met with Richard Gartner and Gareth Knight at CeRch today, to catch up with their investigations into using GATE and OpenCalais to process the EAD outputs from AIM25.

Results look very encouraging. OpenCalais, in particular, generates a post-processing set of identified entities (personal names, place names, corporate names) which Richard G has then created regular expressions to locate these in the body of the EAD and wrap in appropriate EAD tags (<persname> etc).

This suggests that the way forward for enhancing the existing data entry processes for AIM25 will involve dispatching the EAD-compliant data entered by collections manager to OpenCalais, and returning the data, with enhanced markup, for checking by the submitter. This hook should be easy enough to insert for manual, form-based entry; for batch entry processes we will need to assess whether any significant delays are introduced.

We've also started to consider ideas for a URI scheme for the entities identified. Our current working hypothesis is that this will involve defining a "data" namespace for AIM25, binding to http://data.aim25.ac.uk/. Within that we can develop a structure along the lines /person, /place, /corporate_body, and append our unique IDs for each entity. Further research is necessary, particularly into the recommendations of the Cabinet Office recommendations for Designing URI Sets for the UK Public Sector.

These URIs can then be used in identifier attributes for our EAD elements (<persname>, etc.), and thence easily transformed into an RDFa format for the Web-based HTML rendering of the AIM25 catalogues.

Next steps include further investigating how to implement and assert relationships between our entities and other open datasets (e.g. our_entity  is_the_same_as  your_entity). And how to make the authority data, duly marked-up, available as open metadata.

Rory and I can now start to consider suitable approaches to embedding this in our development copy of the existing AIM25 system, and we'll continue to liaise closely with CeRch for advice on  the relative merits of Gate and OpenCalais processing, and guidance on URI implementation.

Wednesday, 23 March 2011

The challenge of adding new records

Five new institutions were being recruited as part of the project to ensure clean data and a 'level playing field' to test the value of Linked Data. These include the National Maritime Museum, Zoological Society and British Postal Museum. The fragility of archive institutions in the current economic climate has been highlighted by news of the reorganisation of Hammersmith and Fulham Archives Service - an OMP partner -following local authority budget cuts. It is still hoped that their records can be included at a later stage, not least because it will enhance their public profile via internet searching, and thereby encourage more active use of collections by researchers, but in the meantime the Royal Botanic Gardens, Kew, have been recruited as a substitute.

The project also highlights the challenge of adding value to archive services through Linked Data or similar projects in the midst of reorganisation and the roll-out of other projects - the National Maritime Museum Archives, for example, are entering a period of public closure prior to the opening of new reading rooms and other public areas as part of a major investment. New catalogues and an archives/library management system are also being installed and tested in 2011. Participation in OMP appealed to the NMM, providing an opportunity to add rich Linked Data to new audiences as part of a larger, more public, initiative. It also represents a challenge, not least for NMM staff who are being asked to prioritise the records that we are seeking to include in the project (a mix of items that are heavily used by the public, for which they receive many written enquiries or which are underused and for which they hope to improve access), and prepare the EAD in a flavour that can readily be imported into AIM25.

I have visited the Zoo, the British Postal Museum and Wandsworth Heritage Services to examine their systems and records. The latter two use CALM, with which AIM25 is familiar, but the Zoo uses a library system - EOSi - requiring import of records in MARC21. A recurring theme of the OMP and other projects is that for new archive IT projects to be rolled out successfully the active support of busy institutional IT services is often indispensable - to set up export tools, develop database tools and amend websites. This places a brake on project delivery times as understandably bespoke work on archive databases might come a long way down an IT services priority list. It also reflects that need for archivists to understand what databases can do and how data can easily be shared - not least in order to communicate effectively with IT helpdesks and keep senior management on board. This is knowledge that is usually acquired through hands-on experience and trial and error, which in turn highlights the value of the many informal support networks among archivists who turn to each other for advice and guidance on how to make data work harder.

Monday, 7 March 2011

National Archives Discovery Event

There were several interesting presentations on Linked Data at an event hosted by The National Archives on behalf of the National Archives Discovery Network on 2 March. The Network is a forum for aggregation services such as AIM25, SCAN, Genesis and the Hub, mapping services such as Vision of Britain, and major institutions such as the British Library, along with other information specialists. The event attracted more than 100 delegates. Presentations included keynote addresses by Richard Wallis of Talis and John Sheridan of the National Archives; a review of the state of play with EAC by Bill Stockting; and reviews of Linked Data projects carried out on Government data and by the BBC, along with the Hub's LOCAH project.

Other talks included reviews of progress on History Pin, the new Google-led initiative which embeds archive digital content in Google Streetview; updates on recent crowdsourcing projects such as the Bentham initiative; and news on cataloguing software including ICA AToM.

Slideshare reviews of the presentations will be available shortly.  

Project Overview for Programme Startup Meeting

What content and metadata are you working with?

Archival catalogue data; ISAD(G)/EAD; Collection level descriptions; Authority files (People, Organisations, Places, Subjects)


How will this data be made available?


Once the Linked Data research team (CeRCH) has established appropriate ontologies,schemas and URI schemes, the Open Metadata will be published in SKOS format (much as UKAT currently is).


What are your use cases for the data?


The use of linked data ontologies within the AIM25 system will provide many opportunities to associate AIM25 records automatically and intelligently with other information resources; and it will allow other information resources to locate and link to archives information in AIM25, enhancing discovery, and supporting the aggregation of AIM25 data into dynamic searches and aggregators across the sector.


The archival authority files that will be published as open data contain a wealth of information of interest and value that could be reused in many ways, in other archive and library systems, as well as in historical, biographical and genealogical contexts . The data could also be extended and enriched in the course of its reuse, and the derived datasets in turn be available for reuse in AIM25.


What benefits to your institution and the sector do you anticipate?

  • Improved discovery/discoverablility
  • Improved linking and interoperability with other web resources
  • Improve takeup for authoritative archives data and metadata 
  • Assessment of added value of linked/semantic data to online archives and cataloguing

Technical approaches / challenges
  • Agreeing and defining ontologies, schemata
  • Implementing effective tools in short timescale
  • Implementing the FLISM popup menu interface

Sunday, 6 March 2011

The Project Plan

Aims, objectives and final outputs of the project


The Open Metadata Pathway or Pathfinder project will deliver a robustly validated demonstrator of the effectiveness of opening up archival catalogues to widened automated linking and discovery through embedding RDFa metadata in Archives in the M25 area (AIM25) collection level catalogue descriptions. It will also implement as part of the AIM25 system the automated publishing of the system's high quality authority metadata as open datasets. The project will include an assessment of the effectiveness of automated semantic data extraction through natural language processing tools (using GATE) and measure the effectiveness of the approach through statistical analysis and review by key stakeholders (users and archivists). All outputs of the project will be integrated into AIM25 resources and workflows, ensuring the sustainability of the benefits to the community.


Summary objectives



Standards based cataloguing with thesaurus support is both time consuming and constrained by subjective and contemporary views about subject choice and relevance. Use of automated semantic metadata extraction through natural language processing tools and Linked Data offer the possibility of upgraded harvesting and wider and more effective subject searching.

The project will deliver a robustly validated pilot embedding RDFa metadata in AIM25 archival collection catalogues, opening up archival catalogues to widened automated linking and discovery. This will include creation of metadata profiles and URI schemes and an assessment of the effectiveness of automated semantic metadata extraction through natural language processing tools (using GATE). The outputs of the project will be integrated into the AIM25 resources and workflows, ensuring that AIM25 content continues to be available in linked data form.

A large amount of accumulated authority metadata (subject terms, personal and place names, geographical names) exists in AIM25 SQL database tables and is already normalised in appropriate standard forms (e.g. NCA Rules). This is used to provide search and access points to the collection records. The project will reimplement these rich metadata resources as embedded RDFa within the online catalogues, and ensure the resulting datasets are openly available for reuse under appropriate open licensing tools (e.g ODL, GPL, Creative Commons) – in consultation with the community and the Programme Manager.

For the benefit of the data creators, workflow and input systems will be revised to support new metadata creation techniques, including authority-based. For the benefit of the end user, these key terms in the catalogues will be implemented as clickable hotspots, offering context-specific linking and searching to other systems. Existing features and functionality will not be compromised.

The pilot system will be used to demonstrate and evaluate the effectiveness of reimplementing existing search tools and entry points within the system using SPARQL, as well as creating an API enabling external services to use the same retrieval tools.

Dual input will be undertaken of over 1,140 entries from CALM and AdLib from six partner institutions, including editorial confirmation of ISAD(G) compliance and creation of UKAT and NCA Name Authority files.

Results and outputs will be evaluated at key milestones by a representative panel of archivists from AIM25 members and users to assess the usability and accessibility.

Outputs
·     A working model of an enhanced AIM25 web application, for demonstration and evaluation purposes, to include SPARQL APIs; reimplementation of existing end-user tools for searches, views and queries using RDF query tools and AJAX.
·     AIM25 authority metadata in linked data format, published with an Open Metadata licence, including SKOS implemenation of the AIM25 thesaurus data;
·     An AIM25 data profile based on the public schemas and ontologies identified for each domain  (eg.DBpedia) and a URI scheme for entities in the AIM25 namespace;
·     Reimplementation of existing AIM25 data creation tools to include RDFa creation, assisted by natural language processing of catalogues via the GATE service;
·     A  published report detailing ongoing and summative evaluation of the techniques used and final outputs;
·     Dissemination activities for the AIM25 partnership, wider archives and access and discovery communications;
·     Optimised user searching of AIM25.

Wider Benefits to Sector & Achievements for Host Institution


Among the contributions the project will make to the sector and host institution are:
·     Make open metadata about archives held in libraries, museums and archive repositories available through the delivery of an open, running pilot system demonstrating an enhanced version of the AIM25 system featuring embedded RDFa, a SPARQL-based query engine and SPARQL endpoint API. The records in the pilot system will number 1140 (increasing to 16,140 when the project outputs are implemented for the live AIM25 system).
·     Make the rich, validated and reviewed authority datasets of AIM25 available in an open format, under open licensing terms, for reuse by the archival and wider community. These include tens of thousands of entries including thesaurus terms (UKAT based, with local and MeSH additions), and personal, place and corporate names structured to NCA rules.
·     Deliver a detailed account of the process and outcomes of creating and implementing linked data profiles for ISAD(G)/EAD based archival metadata and offer a clear articulation of how established descriptions and authority metadata standards may be delivered and maintained as open metadata
·     Provide a coherent analysis and examples for the archival and wider access and discovery community of the value, effectiveness and potential of the approach to delivery using RDF,  in terms of widening access and deepening use and providing and opportunity to learn how the approach optimises the use of archival staff time.
·     Produce knowledge and practice that enhances and optimises AIM25, including a working model and which may be of benefit to the other institutions holding archives.
·     Deliver optimised user searching tools and techniques for use in the AIM25 system that AIM25 will commit to implementing in its live system as soon as possible after the completion of the project. (A full, live launch across AIM25 has been excluded from the project scope owing to the limited timescale available.)
Risk analysis
Risk
Probability
Severity
Score

Archives to prevent  / manage risk
Difficulty in recruiting and retaining staff
1
3
3
Most staff are already employed by partners and this time will be bought out. The project will also distribute knowledge throughout the project to limit the effects if a staff member leaves. Given the short duration of the project gaps will be filled by the use of agency staff or internal secondments
New partners are unable to supply numbers of descriptions
2
2
4
Utilise new accession material from existing partners. Fallback on existing data.

A complete testbed and evaluation cannot be implemented within the time frame
2
2
4
Project management team will closely monitor progress of objectives and outputs. If necessary, with the agreement of the Programme Manager, some activities can be re-scoped to ensure an effective outputs are achieved.
Failure to meet project milestones
2
3
6
Produce project plan with clear objectives. Continuous project assessment and close communication between project manager, technical leads, and JISC programme manager to ensure targets are realistic, achievable and focuse on project goals.
IPR

IPR in all reports and other documents produced by the project will be retained jointly by King’s College London and ULCC but made freely available on a non exclusive license as required/advised by JISC. All software and data created during the project will be made available to the community on an open licence. We will respect the licence model of all third parties and during the project, most of which is made available under open source licences.
Project team relationships and end user engagement

The project will be overseen by a board comprising: Patricia Methven, Director of AIM25 (Chair); Kevin Ashley (Director, Digital Curation Centre); Mark Hedges (Deputy Director, CeRCh), Geoffrey Browell, Senior Archivist (King’s College Archives Services), Richard Davis (ULCC Digital Archives), and five nominated members of AIM25 reflecting new and existing partners. Input from other leading figures from JISC digital archives projects will be invited. The project will be managed by Geoff Browell with specialist and technical support from Richard Davis and Gareth Knight. Project staff will be ex officio members.
End-user Engagement
The project will establish a project blog to record  progress and invite comment. The  project team will work proactively with other RDTF activities and projects, including LOCAH and CHALICE, to identify synergistic goals and approaches. We will also work with the Open/Linked Data and Semantic Web communities to ensure the maximum dissemination opportunities for outputs, and for developing the new AIM25 API. Services such as LinkedData.org and PTWS.com will be used to publicise the availability of the data. Project outputs will be made available on the project website. Dissemination to the wider archival, museum and library will be offered through professional conferences and press of ARA, CILIP, RLUK, SCONUL and the Museums Association. Websites such as Culture24 and Museums, Libraries and Archives Council will also be notified. A regional dissemination event will be hosted by the AIM25 partnership in addition to hosted JISC events.


Timeline, workplan and methodology

Work package 1: Project management
This covers management activity throughout the project. It will assemble the project team; prepare the detailed project plan; establish the steering group; and agree the configuration of the project testbed. Cross-institutional, cross-partnership involvement will require close liaison between all partners, including existing AIM25 partners. There will be monthly meetings, at least four focus groups, two from each of the user and archival communites, to undertake the evaluation and ad hoc communication.Deliverables: Detailed project plan; progress and risk assessment reports; project and focus group meetings; exit and sustainability plan; ongoing coordination; liaison with JISC programme manager. Led by King’s College London Archives.

Work package 2: Testbed record selection and creation
Import into existing AIM25 of 1140 ISAD(G) new compliant collection (fonds) descriptions directly from propriety software, CALM or AdLib as appropriate through an established automated ingest protocol developed in association with the Archives Hub. The entries will cover the full archival holdings of the National Maritime Museum (700 collections), and the most significant records of Zoological Society of Great Britain (100), and the British Postal Museum (100). Those for the London Boroughs of Hammersmith and Fulham (100 collections representing 8% of collection level descriptions) and Wandsworth (100 collections representing 80% of their collection level descriptions), de facto the collections regarded by custodians of significant wider interest and those which have been prioritised for cataloguing (the Borough percentages do not reflect physical extent). An additional 40 new descriptions will be added by King’s College London representing accessions for 2009/10, 2.5% percentage of the full total for King’s already available on AIM25. Name authority and subject terms will be added for these entries in the normal way through experienced externally contracted staff. Collections are defined accordingly to their provenance and range from one to a thousand boxes.Deliverables: Creation and configuration  of collection descriptions for testbed content. Led by King’s College London Archives.
Work package 3: Metadata profiling and processing
Analysis of testbed materials to define metadata requirements. This will include a review of relevant and recent outputs in the field, such as LOCAH, CHALICE. To drive out the rich seams of information in the narrative texts of the ISAD(G) descriptions (including personal and corporate names, place names and dates) the project will use GATE (General Architecture for Text Engineering) – a Java-based natural language processor developed by the University of Sheffield  - to parse unstructured content and identify key entities. The outputs of GATE processing will be evaluated in conjunction with existing authority records in AIM25. A URI scheme will be defined to enable the resulting metadata to be published as open data. Entities will be tagged and identified with a URI and marked-up text will be exported to EAD. Deliverables: Creation of an RDF enriched corpus; creation of a URI scheme utilising GATE outputs and existing authority records within AIM25; creation of style sheets to  transform GATE outputs to EAD; definition of requirements for RDF triple store. Led by King’s College London CeRch.

Work package 4: Implementation
Implementation of WP3 recommendations within a copy of the current AIM25 system. This will include RDF triple store, re-implementation ofe-existing search tools and entry points within the system using SPARQL, and creation of API enabling external services to use the same retrieval tools. Implementation of a tool to support highlighting of key terms in cataloguing as clickable hotspots, offering  content-specific linking and searching to other systems with Web APIs likely to be of use to researchers/end-users. Convert AIM25 authority records to RDF and publish as open metadata. Define and implement enhanced AIM25 browsing interface, date entry interface and APIs. Deliverables: Working model of RDFa-enhanced AIM25 system (including end-user and data-entry enhancements); tools to create and publish open metadata; published open metadata  and exemplar for evaluation. Led by ULCC

Work package 5: Evaluation
Evaluation of outputs of WP3 and WP4 using statistical assessment, web analytics and structured survey  techniques. Conduct of two focus groups with archivists, new and existing AIM25 partners, and two drawn from academic users from a variety of disciplines to compare existing AIM25 and open metadata AIM25 searches. Deliverables. Definition of evaluation approach; statistical user and community evaluation of approach to open metadata, GATE processing and enhancements. Led by King’s College London Archives Services.

Work package 6: Dissemination
The project will establish a project blog to record  progress and invite comment. The  project team will work proactively with other RDTF activities and projects, including LOCAH and CHALICE, to identify synergistic goals and approaches. We will also work with the Open/Linked Data and Semantic Web communities to ensure the maximum dissemination opportunities for outputs, and for developing the new AIM25 API. Services such as LinkedData.org and PTWS.com will be used to publicise the availability of the data. Project outputs will be made available on the project website. Dissemination to the wider archival, museum and library will be offered through professional conferences and press of ARA, CILIP, RLUK, SCONUL and the Museums Association. Websites such as Culture24 and Museums, Libraries and Archives Council will also be notified. A regional dissemination event will be hosted by the AIM25 partnership in addition to hosted JISC events.

2011
Feb
Mar
Apr
May
Jun
July
WP1
X
X
X
X
X
X
WP2
X
X
X



WP3
X
X
X



WP4


X
X
X
X
WP5




X
X
WP6
X


X
X
 
 
 
Budget
Directly Incurred
Staff
August 10– July 11
August 11– July 12
TOTAL £
Grade 6,  10 days & 9% FTE
£2340.80
£
£2340.80
Grade 6, 27 days & 24.5 %FTE
£5526.90
£
£5526.90
Grade 8- point 46, 8 days, 7%FTE
£2616.00
£
£2616.00
Grade 7-point 43, 29 days, 26% FTE
£9483
£
£9483
Indexer A, Grade 2, 6 months, 35 % FTE
£3545.30
£
£3545.30
 Indexer B, Grade 2, 6 months 35% FTE
£3545.30
£
£3545.30
External Contractor
£2679.42
£
£2679.42
Total Directly Incurred Staff (A)
£29736.72
£
£29736.72




Non-Staff
August 10– July 11
August 11– July 12
TOTAL £

Travel and expenses
£800
£
£800
Hardware/software
£1000
£
£1000
Dissemination
£800
£
£800
Evaluation
£400
£
£400
Other
£1000
£
£1000
Total Directly Incurred Non-Staff (B)
£ 4,000
£
£ 4,000




Directly Incurred Total (C)
(A+B=C)
£33,736.72

£
£33,736.72





Directly Allocated
August 10– July 11
August 11– July 12
TOTAL £

Staff Grade7 –point 38, 6 months, 20% FTE
£5317.33
£
£5317.33
Estates
£4956.00
£
£4956.00
Other
£
£
£
Directly Allocated Total (D)
£10273.33
£
£10273.33




Indirect Costs (E)
£30,734.68
£
£30,734.68




Total Project Cost (C+D+E)
£74,744.73
£
£74,744.73
Amount Requested from JISC
£40000
£
£40000
Institutional Contributions
£34744.73
£
£34744.73




Percentage Contributions over the life of the project
JISC
54%
Partners
46 %
Total
100%