Monday 19 September 2011

Next steps

Previous posts have highlighted some of the benefits of the project.

Key findings include:

  • An approach to adding supplementary, semantic, metadata that builds on the work practices of archivists rather than imposing a new layer of work
  • Semantic analysis can supplement existing indexing, making its creation faster, more accurate and more consistent
  • Existing natural language processing services such as Open Calais still lack archival-LMA related vocabularies, so have some way to go before they 'learn' the language of libraries, museums, archives, galleries and the research environment
  • Opportunities for enhancing discoverability with roll-over features and drop downs to make Linked Data of mainstream value to users and not the preserve of a minority of technical experts
Next steps
  • Further refinement of the editing interface to make mark-up clearer and with features such as bulk analysis of ISAD(G) records
  • Further testing of sample data across key natural language processing services to ask certain questions - what are they missing? - how accurate are they? - how could they be improved and what role would JISC have in this?
  • More testing of front end delivery pages in AIM25 using external data, particularly geographical and name authority data via projects such as Linking Lives
  • Moving beyond AIM25 to link archive descriptions with other services via an API. Linked Data is only useful if users can connect and aggregate, mix and mash
  • Using services to mix archival, bibliographic and museums content information, and to cross-walk with broader content used by wider audiences (such as genealogical data, metadata for digitised collections - eg newspapers, mapping or crowdsourced data) 
  • Surveying which services need to be created with existing database records, that can then be linked together. This should be user and service-driven with input from major national institutions such as the TNA and BL, as well as JISC

Monday 5 September 2011

Some assessment about the scalability and robustness of the processing against OpenCalais

The title of this post is a question from our evaluator David Kay that I somewhat failed to answer in my last post so here goes.

Scalability of OpenCalais use (some numbers)

We are using the Open Calais web service (http://www.opencalais.com/documentation/calais-web-service-api) on a free basis. OpenCalais allows for 50,000 transactions per day per user on this basis. Anyone using the OMP workflow prototype will be regarded as the same user. When running the analysis in the workflow each ISAD(G) element that is selected for analysis represents a single openCalais transaction.

There are currently 23 text boxes that can be selected for analysis. However during focus group meetings with the AIM25 archivists it was suggested that only a 2 or 3 or these would be likely to have potentially meaningful analysis.

In the last 9 or so years AIM25 has accrued records for 15,335 collections. So some back of envelope maths would tell us:

Rate of record addition: 15,335 / 9yrs that's roughly 4.5 records/day

Even in the unlikely event that someone analysed all 23 text-areas, that would be just over 100 transactions per day. So there is some wriggle room in the 50,000 limit for edits, re-analysis, etc.

Of course the reality of how archivists use AIM25 is probably not very well represented by those numbers. The potential constraint placed on archivist throughput due to openCalais analysis (ie max records that could be analysed in a day) would be about 2,174. I'll let others who are more qualified to comment on whether there has ever been, or may be in the future, a throughput that exceeds that.

Robustness of processing against OpenCalais (some churn)

The analysis does take some time and as one would expect, the greater the number of elements being analysed and the length of the text blocks sent for analysis will increase the time taken. The prototype leaves the annotation and processing of the result RDF to the javascript element of the AJAX process. This means that there is reliance on the performance of the client machine which is an unknown.

For the largest block of text in the current system (56,544 characters) the browser-side processing did become mired in "Unresponsive script" messages. The request to openCalais was not the thing causing the lag, so the culprit was the browser-side processing of the result. All this would suggest that more of this post-response processing should be pushed over to the server.

A move to more server-side processing would also improve extensibility of the framework. Server-side brokerage of results from a range of services would allow for a more consistent response both for AIM25 workflow and for any potential third party clients.

Friday 2 September 2011

The OMP prototype ISAD(G) editor - How it works

This post describes the prototype workflow tool. How it can be used for recording AIM25 collection level records and some associated technical issues.

General aims of prototype

  1. To improve on the usability of the existing offering
  2. To make use of semantic annotation within the tool
  3. To use linked data to enhance the user experience of the AIM25 website

Detailed aims

Aim 1.

  • a) Eliminate the need for archivists to use mark-up
  • b) Integrate the indexing process with the metadata recording process
  • c) Reduce page scrolling and generally improve usability

Aim 2.

  • a) Analyse the textual input against existing authoritative sources both external and internal
  • b) Suggest and record indexing terms derived from the analysis
  • c) Record the semantic properties of terms

Aim 3

  • a) Mark-up the semantic properties of indexed terms both within the ISADG display and within the “Access points” lists
  • b) Provide links to related services based on the semantic properties of the terms
Technological details

Aim1

For the prototype we took a snapshot of the AIM25 database and put this on the OMP test server (http://data.aim25.ac.uk).

An alternative workflow for adding and editing collection level records was built using in PHP using this snapshot as a data source. The Javascript framework jQuery was also used to control on-screen actions and provide a bit of dynamism. The ISAD(g) elements are grouped into the areas as described by documents like this http://is.gd/GEXdG2 each area given a tab.

becomes


The access points are displayed on the right. Terms are colour coded according to the four term types:
  • person
  • place
  • organisation
  • concept
Each term classified under these types are represented by the OMP data-browser which parses RDF that is in-turn derived from the existing database (more on that later). Archivists can remove access points by dragging and dropping terms in to the dustbin icon.

Aim2

The prototype workflow uses one external service (openCalais) directly to analyse text and suggest useful terms for indexing. AIM25's existing index is also interrogated, this dataset includes the UK archival thesaurus. As a result an AIM25 text analysis service was developed.

For rapid development this service runs boring old SQL on the existing AIM25 data tables, but as there is already a mechanism to transpose this data into RDF (and more) a more robust semantic solution is theoretically a short hop away.

Archivists can use check-boxes by the side of each textarea in the workflow form to select ISAD(G) elements for analysis. The selected text is sent for analysis by one or both of the services and results are displayed in two ways.
  • As embedded mark-up in the "textarea"
  • As term lists in the Analysis/Indexing area

Above is an example of a list of terms returned from the AIM25 service . The term "Weaving" is in the process of being added as an access point for this record.


Here we see the same results embedded in the text. These are a smaller set as they only include the exact matches. When saved, terms not added to the access points are stripped out. Those that remain can be represented in context as RDFa.


Here the results returned by OpenCalais are embedded and below they are displayed as a list so that they can be added to the access points. Also below are the results of a direct lookup on the AIM25 service so that archivists can add access points for terms that do not appear in the text.



Did we achieve any of aim3? More to follow on this soon...