Behemoth and Text Analysis

behemoth_demo.js_.txt118 bytes
behemoth_demo.swf1.75 MB

We were asked recently to process a lot of word documents to extract information from them. This article talks about some of the tools we are using.


The core system we use for this sort of thing is called GATE - The General Architecture for Text Engineering. ( ) Basically GATE is a java based framework for people (mostly academics!) to build text processing pipelines. It uses a number of other components itself through an extensive plugin mechanism. These plugins include:

* Tika for parsing different file formats (
* Gazetteer (For looking terms up in predefined lists)
* JAPE A programming language for logical decisions dependent on document annotations

and many more.

GATE is quite complicated and powerful and seems to be an interesting opportunity for industry to catch up with Academia.

Unfortunately GATE was designed to run on single machines and only recently have a couple of distributed versions become available. GATE Cloud is the home grown project, but the one I am most familiar with is Behemoth created by a small company in Bristol. Behemoth's premise is to make GATE work in a Hadoop environment.


Hadoop is whole ecosystem of tools for distributed "BigData" computing. It is primarily disk based rather than memory based, so is typically used for problems which involve processing many terabytes at once. If you can fit all your data into chip based memory then Hadoop might not be the solution.

Core Hadoop involves two main services -

* HDFS a distributed, fault tolerant, block based wrote-once file system
* MapReduce - a distributed fault tolerant system for spreading compute over lots of nodes. MapReduce jobs can be written in Java, JVM based languages, or even other languages through streaming.


Behemoth is java code for getting documents into HDFS in a format that GATE understands, (and out again), plus a way of launching GATE in the MapReduce system. ( )

(If the animation below does not appear then please select the file itself