Skip to content
Johann Petrak edited this page Aug 16, 2016 · 28 revisions

GATE

GATE Learning Framework Plugin

[NOTE: the documentation is still work in progress!!]

The Learning Framework is GATE's most recent machine learning plugin. It's still under active development, and undergoing some flux still, but stable enough to use. It offers a wider variety of more up to date ML algorithms than earlier machine learning plugins, currently supporting various Mallet classification algorithms, Mallet's CRF implementation and LibSVM. In addition Weka classification and regression algorithm can be used by running Weka externally using the weka-wrapper tool (see Using Weka) or the SciKit-Learn learning algorithms can be used by running SciKit Learn externally using the sklearn-wrapper tool (see Using SciKit Learn) or Keras can be used by running it externally using the keras-wrapper tool (see Using Keras).

It offers broadly the same functionality as the Batch Learning PR, with some differences--in addition to providing a broader range of algorithms, it is likely to be faster to train and apply under most circumstances, export to sparse ARFF format is included, and the interface design is a little different, offering more settings in the form of runtime parameters, and supporting multiple trained models in a more user-friendly way.

The Learning Framework implements different task modes:

  • Classification, which simply assigns a class to each instance annotation. For example, each sentence might be classified as having positive or negative sentiment, each word may get assigned a part-of-speech tag, or a document may be classified as being relevant to some topic or not. With classification, the parts of text are known in advance and assigned one out of several possible class labels.
  • Sequence tagging, also called Chunking, which finds mentions, such as locations or persons, within the text, i.e. the relevant parts of text are not known in advanced but the task is to find them.
  • Regression, which assigns a numerical target, and might be used to rank disambiguation candidates, for example. This is similar to classification in that the relevant parts of text (sentences, words, ...) are known in advance, but instead of a nominal class label, a numeric value is assigned to those parts.

These are provided in separate processing resources (PRs), with separate PRs for training and application and evaluation plugins for classification and regression. The plugin also includes an export PR, allowing GATE to be used to prepare feature files from textual data that can then be exported and used outside of GATE.

Get started here!

Feature Overview

  • Supports classification, regression, sequence tagging
  • Supports learning algorithms from: LibSVM, Mallet, Weka (using a wrapper software) and Scikit-Learn (using a wrapper software)
  • Supports various ways of handling missing values
  • Supports coding of nominal values as one-of-k or as "value number"
  • Supports instance weights
  • Supports per-instance classification cost vectors instead of the class label for classification for per-instance cost aware algorithms (however none working yet)
  • Supports limiting attribute lists to only those annotations which are within another containing annotation
  • Supports using pre-calculated scores for one-of-k coded nominal values, e.g. pre-calculated TF*IDF scores for terms or ngrams (for n-grams with n>1 the final score is calculated as the product of the individual pre-calculcated gram scores)
  • Supports multi-valued annotation features for one-of-k coded nominal attributes: for example if the annotation feature is a List, a dimension / feature is created for each element in the list
  • Supports multi-valied annotation features for numeric attributes: in this case the elements (which must be doubles or must be convertible to doubles) are "spliced" into the final feature vector (e.g. for making use of pre-calculated word embeddings).

Processing Resources:

Other important documentation pages:

Clone this wiki locally