Monday 5 September 2011

Some assessment about the scalability and robustness of the processing against OpenCalais

The title of this post is a question from our evaluator David Kay that I somewhat failed to answer in my last post so here goes.

Scalability of OpenCalais use (some numbers)

We are using the Open Calais web service (http://www.opencalais.com/documentation/calais-web-service-api) on a free basis. OpenCalais allows for 50,000 transactions per day per user on this basis. Anyone using the OMP workflow prototype will be regarded as the same user. When running the analysis in the workflow each ISAD(G) element that is selected for analysis represents a single openCalais transaction.

There are currently 23 text boxes that can be selected for analysis. However during focus group meetings with the AIM25 archivists it was suggested that only a 2 or 3 or these would be likely to have potentially meaningful analysis.

In the last 9 or so years AIM25 has accrued records for 15,335 collections. So some back of envelope maths would tell us:

Rate of record addition: 15,335 / 9yrs that's roughly 4.5 records/day

Even in the unlikely event that someone analysed all 23 text-areas, that would be just over 100 transactions per day. So there is some wriggle room in the 50,000 limit for edits, re-analysis, etc.

Of course the reality of how archivists use AIM25 is probably not very well represented by those numbers. The potential constraint placed on archivist throughput due to openCalais analysis (ie max records that could be analysed in a day) would be about 2,174. I'll let others who are more qualified to comment on whether there has ever been, or may be in the future, a throughput that exceeds that.

Robustness of processing against OpenCalais (some churn)

The analysis does take some time and as one would expect, the greater the number of elements being analysed and the length of the text blocks sent for analysis will increase the time taken. The prototype leaves the annotation and processing of the result RDF to the javascript element of the AJAX process. This means that there is reliance on the performance of the client machine which is an unknown.

For the largest block of text in the current system (56,544 characters) the browser-side processing did become mired in "Unresponsive script" messages. The request to openCalais was not the thing causing the lag, so the culprit was the browser-side processing of the result. All this would suggest that more of this post-response processing should be pushed over to the server.

A move to more server-side processing would also improve extensibility of the framework. Server-side brokerage of results from a range of services would allow for a more consistent response both for AIM25 workflow and for any potential third party clients.

No comments:

Post a Comment