Learning multilingual named entity recognition from Wikipedia

Using these resources

These resources are subject to a CC-BY 3.0 license.

Please cite our Artificial Intelligence Journal paper in any published works using these resources, except where noted below:

@Article{nothman2012:artint:wikiner,
  author = {Joel Nothman and Nicky Ringland and Will Radford and Tara Murphy and James R. Curran},
  title = {Learning multilingual named entity recognition from {Wikipedia}},
  journal = {Artificial Intelligence},
  publisher = {Elsevier},
  note = {(in press)},
  year = {2012},
  doi = {10.1016/j.artint.2012.03.006},
  url = {http://dx.doi.org/10.1016/j.artint.2012.03.006}
}

Page classification type scheme

Our annotation process used a dynamic schema of fine-grained hierarchical types, allowing us to map these to coarser granularities for experimentation and application.

The scheme file contains four tab separated columns:
  • Type - a hierarchical Named Entity type used to label Wikipedia pages
  • CoNLL - a coarse-grained representation of the type based around CoNLL NE types (referred to as Coarse in our Artificial Intelligence 2011 paper)
  • Medium - a coarse-grained representation of the type (referred to as Fine in our Artificial Intelligence 2011 paper).
  • Fine - a coarse-grained representation of the type.
There are two non-standard Types in the scheme - both marked _ignore in the CoNLL, Medium and Fine columns:
  • _Error - denotes pages that should not be classified or for building NE models, as a result of errors in our annotation software.
  • _Deleted - denotes pages that were present when sampling articles for annotation, but not when performing annotation.

Gold-standard page classifications

There are two sets of gold-standard Wikipedia page classifications sampled as described in the Artificial Intelligence paper:
  • Popular: 2k English Wikipedia pages classified using the type scheme above. (Introduced in Tardif et al. 2009.)
  • Random: 4k Wikipedia pages from 9 languages with the distribution below.
Sample Language Pages
POPULAR EN 2322
RANDOM EN 2531
RANDOM DE 872
RANDOM ES 203
RANDOM FR 210
RANDOM IT 203
RANDOM NL 286
RANDOM PL 210
RANDOM PT 202
RANDOM RU 223
The files contain three tab-separated columns, each representing a classified page:
  • Language - the language of the Wikipedia the page is drawn from.
  • Title - the title of the page, utf-8 encoded.
  • Type - the type of the classification, drawn from the Type column in the scheme.

An earlier gold-standard labelling was performed by a single annotator, first described in Nothman et al., 2008. It utilises a different schema; had a single annotator; sampled 1100 articles randomly and 200 from those articles with a high number of incoming links (these are the last 200 entries in the file). The line format is PAGE_TITLE\tNE_TYPE.

WikiGold

Our gold NER annotations over a small sample of Wikipedia articles, in CoNLL format: WikiGold

See Balasuriya et al. (2009).

Related publications

  • Dominic Balasuriya, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran (2009).
    Named entity recognition in Wikipedia. In Proceedings of the Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (PeoplesWeb), pages 10–18.

    Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia’s link structure to automatically generate near gold-standard annotations. Until now, these resources have only been evaluated on newswire corpora or themselves.We present the first NER evaluation on a Wikipedia gold standard (WG) corpus. Our analysis of cross-corpus performance on WG shows that Wikipedia text is a harder NER domain than newswire. We find that an automatic annotation of Wikipedia has high agreement with WG and, when used as training data, outperforms newswire models by up to 7.7%.

  • Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R. Curran (2012).
    Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence (in press). Elsevier.

    We automatically create enormous, free and multilingual “silver”-standard training annotations for named entity recognition (NER) by exploiting the text and structure of Wikipedia. Most NER systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes.We first classify each Wikipedia article into named entity (NE) types, training and evaluating on 7,200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy.We transform the links between articles into NE annotations by projecting the target article’s classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards.We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against CoNLL Shared Task data and other gold-standard corpora. Our approach outperforms other approaches to automatic NE annotation (Richman08,Mika08); competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.

  • Joel Nothman, James R. Curran, and Tara Murphy (2008).
    Transforming Wikipedia into named entity training data. In Proceedings of the Australasian Language Technology Workshop.

    Statistical named entity recognisers require costly hand-labelled training data and, as a result, most existing corpora are small. We exploit Wikipedia to create a massive corpus of named entity annotated text. We transform Wikipedia’s links into named entity annotations by classifying the target articles into common entity types (e.g. person, organisation and location). Comparing to MUC, CoNLL and BBN corpora, Wikipedia generally performs better than other cross-corpus train/test pairs.

  • Joel Nothman (2008).
    Learning named entity recognition from Wikipedia.
    Honours thesis, University of Sydney.

    We present a method to produce free, enormous corpora to train taggers for Named Entity Recognition (NER), the task of identifying and classifying names in text, often solved by statistical learning systems.Our approach utilises the text of Wikipedia, a free online encyclopedia, transforming links between Wikipedia articles into entity annotations.Having derived a baseline corpus, we found that altering Wikipedia’s links and identifying classes of capitalised non-entity terms would enable the corpus to conform more closely to gold-standard annotations, increasing performance by up to 32% F$ score.The evaluation of our method is novel since the training corpus is not usually a variable in NER experimentation.We therefore develop a number of methods for analysing and comparing training corpora.Gold-standard training corpora for NER perform poorly (F$ score up to 32% lower) when evaluated on test data from a different gold-standard corpus.Our Wikipedia-derived data can outperform manually-annotated corpora on this cross-corpus evaluation task by up to 7% on held-out test data.These experimental results show that Wikipedia is viable as a source of automatically-annotated training corpora, which have wide domain coverage applicable to a broad range of NLP applications.

  • Joel Nothman, Tara Murphy, and James R. Curran (2009).
    Analysing Wikipedia and gold standard corpora for NER training. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL).

    Named entity recognition (NER) for English typically involves three gold standards: MUC, CoNLL, or BBN, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text.We present a comprehensive cross-corpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on cross-corpus evaluation by up to 11%.

  • Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran (2009).
    Classifying articles in English and German Wikipedia. In Proceedings of the Australasian Language Technology Association Workshop (ALTW), pages 20–28.

    Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural features of Wikipedia, we can develop a German corpus which accurately classifies Wikipedia articles into NE categories to within 1% F-score of the state-of-the-art process in English.

  • Sam Tardif, James R. Curran, and Tara Murphy (2009).
    Improved text categorisation for Wikipedia named entities. In Proceedings of the Australasian Language Technology Association Workshop (ALTW), pages 104–108.

    The accuracy of named entity recognition systems relies heavily upon the volume and quality of available training data. Improving the process of automatically producing such training data is an important task, as manual acquisition is both time consuming and expensive. We explore the use of a variety of machine learning algorithms for categorising Wikipedia articles, an initial step in producing the named entity training data. We were able to achieve a categorisation accuracy of 95% F-score over six coarse categories, an improvement of up to 5% F-score over previous methods.