AnnoMarket - Text Annotation in the cloud

Last week I attended the Text Analytics Meetup in London where I saw a presentation about the beta launch of [Disclaimer: Since is currently Beta quality code anything I say might be improved or totally wrong by the time of full commercial release]

AnnoMarket seems to solve one technical problem in the field of Natural Language Processing (NLP): annotating documents. The business problem is “How do I automatically annotate my text so that I can see what is in that text without employing lots of human beings to read it all and tag it up manually”

This is not exactly new. The most famous similar service is OpenCalais from Thomson Reuters which has been going for some years. OpenCalais is about taking unstructured documents and finding structure and meaning within them. To quote the OpenCalais website “The free OpenCalais service and open API is the fastest way to tag the people, places, facts and events in your content.” Its main selling point is that Thomson Reuters knows about lots of people, companies, and so on - as it is an information company.

OpenCalais has been the yardstick for a number of years. However in the fast growing academic world of Text Analytics (aka Natural Language Processing, aka Text Engineering) there has been dissatisfaction with what OpenCalais did.

Several programming libraries are being developed to make NLP easier for programmers. One of the most popular frameworks for this in java is “GATE”. This contains plugins which do various parts of the NLP problem - which you chain together into a pipeline. Systems like GATE were meant to provide developers with a common framework to write NLP code in Java - and a way that they could develop semantic pipelines without worrying too much about writing the Java components. But importantly GATE also provides a consistent way of running that code by people who are not Java developers.

AnnoMarket fundamentally runs GATE on your behalf. You feed it documents and run a number of different tools to break up the text, identify named entities in the text and do pretty much anything that can be done with NLP. You can either give them a bulk set of documents and process them as a batch, or feed it single documents and expect a near immediate response.

So looking at the AnnoMarket proposition again, the technical problem is “how do I get a commercially supported version of GATE to work in the cloud as a SaaS (Software as a Service), so that I don’t have to run it myself, but also track how much resources are used so I know it is cheaper than running it all myself”.

I think of GATE a bit like this: (Largely cribbed from the “2 Minute Guide to GATE”

-Obtain your documents
-Try to figure out whether the terms in that document are in some kind of structure - an ontology
-Try to figure out what how the document ought to be tagged - with the business experts. (This is the Gold Standard)
-Build a semantic pipeline which tries to achieve this.
-Turn this pipeline into a “GATE Application” which can either be embedded in other Java code, or run in GATE Cloud, or if you like Hadoop run through Behemoth.
-Do something smart with the results - such as push them to SolR or Mimir, or an RDF database, or a Graph database, whatever….

Personally I have run GATE applications on a small Hadoop cluster through Behemoth. I am happy that I built myself a scalable system - BUT that required my BigData experience and spare hardware.

When we do this ourselves I call upon my colleague Meg to build a GATE pipeline which does the NLP we want. She investigates what components are available, ties them together with JAPE rules, and sometimes builds more components in Java.

All of this requires significant technical skills which are not always available in big publishing firms. After all if your main business is textual information why would you be an expert in running clusters or writing Java.

Being essentially a commercial REST service you can access AnnoMarket in a number of different programming languages - and several libraries are provided to make it easier. If you want to learn about what the available pipelines do then you can feed small documents into a web interface and see the tags generated.

However I can’t decide whether this is AnnoMarket’s strength or weakness. It can do pretty much anything GATE can do - but that requires you to be fairly familiar with GATE before you start. Time will tell whether this works. You can of course employ one of a number of GATE engineers (*cough* such as us) or learn about it yourself.

However if you need to build your own plugins then you have to go through an untested path of submitting them to the AnnoMarket system (by talking to them) and having them available to anyone. We shall see how this works in the long term.

So what is my verdict:I am worried about its long term business future. I fear that the problem is one which either you know how to cope with - or you probably don’t need it solved. However for the small NLP company it might be great. We want to supply excellent NLP applications but don’t want the hassle of running our own servers to run them. So it may still work.

There is a lot more to say but lets get this out there and add more to another article later on.

References GATE Main Site The 2 Minute Guide to GATE