Joel Nothman¶
- Position
- 4th year PhD student
- Affiliation
- University of Sydney School of IT, Capital Markets CRC
- Homepage
- http://www.joelnothman.com/research
- History
- now a PhD student in our lab
- Ben Hachey, Will Radford, Joel Nothman, Matthew Honnibal, and James R. Curran (2012).
Evaluating entity linking with Wikipedia. Artificial Intelligence (in press). Elsevier.Named Entity Linking (NEL) grounds entity mentions to their corresponding node in a Knowledge Base (KB). Recently, a number of systems have been proposed for linking entity mentions in text to Wikipedia pages. Such systems typically search for candidate entities and then disambiguate them, returning either the best candidate or NIL. However, comparison has focused on disambiguation accuracy, making it difficult to determine how search impacts performance. Furthermore, important approaches from the literature have not been systematically compared on standard data sets. We reimplement three seminal NEL systems and present a detailed evaluation of search strategies. Our experiments find that coreference and acronym handling leads to substantial improvement, and search strategies account for much of the variation between systems. This is an interesting finding, because these aspects of the problem have often been neglected in the literature, which has focused largely on complex candidate ranking algorithms.
- Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R. Curran (2012).
Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence (in press). Elsevier.We automatically create enormous, free and multilingual “silver”-standard training annotations for named entity recognition (NER) by exploiting the text and structure of Wikipedia. Most NER systems rely on statistical models of annotated data to identify and classify names of people, locations and organisations in text. This dependence on expensive annotation is the knowledge bottleneck our work overcomes.We first classify each Wikipedia article into named entity (NE) types, training and evaluating on 7,200 manually-labelled Wikipedia articles across nine languages. Our cross-lingual approach achieves up to 95% accuracy.We transform the links between articles into NE annotations by projecting the target article’s classifications onto the anchor text. This approach yields reasonable annotations, but does not immediately compete with existing gold-standard data. By inferring additional links and heuristically tweaking the Wikipedia corpora, we better align our automatic annotations to gold standards.We annotate millions of words in nine languages, evaluating English, German, Spanish, Dutch and Russian Wikipedia-trained models against CoNLL Shared Task data and other gold-standard corpora. Our approach outperforms other approaches to automatic NE annotation (Richman08,Mika08); competes with gold-standard training when tested on an evaluation corpus from a different source; and performs 10% better than newswire-trained models on manually-annotated Wikipedia text.
- Joel Nothman, Matthew Honnibal, Ben Hachey, and James R. Curran (2012).
Event linking: grounding event reference in a news archive . In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL) (to appear). - Will Radford, Ben Hachey, Matthew Honnibal, Joel Nothman, and James R. Curran (2011).
Naive but effective NIL clustering baselines -- CMCRC at TAC 2011. In Proceedings of the Text Analysis Conference (TAC).This paper describes the CMCRC systems entered in the TAC2011 entity linking challenge. We used our best-performing system from TAC2010 to link queries, then clustered NIL links. We focused on naive baselines that group by attributes of the top entity candidate. All three systems performed strongly at 75.4% B3 F1, above the 71.6% median score.
- Will Radford, Ben Hachey, Joel Nothman, Matthew Honnibal, and James R. Curran (2010).
Document-level entity linking: CMCRC at TAC 2010. In Proceedings of the Text Analysis Conference (TAC).This paper describes the CMCRC systems entered in the TAC 2010 entity linking challenge. The best performing system we describe implements the document-level entity linking system from Cucerzan (2007), with several additions that exploit global information. Our implementation of Cucerzan’s method achieved a score of 74.9% in development experiments. Additional global information improves performance to 78.4%. On the TAC 2010 test data, our best system achieves a score of 84.4%, which is second in the overall rankings of submitted systems.
- Dominic Balasuriya, Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran (2009).
Named entity recognition in Wikipedia. In Proceedings of the Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (PeoplesWeb), pages 10–18.Named entity recognition (NER) is used in many domains beyond the newswire text that comprises current gold-standard corpora. Recent work has used Wikipedia’s link structure to automatically generate near gold-standard annotations. Until now, these resources have only been evaluated on newswire corpora or themselves.We present the first NER evaluation on a Wikipedia gold standard (WG) corpus. Our analysis of cross-corpus performance on WG shows that Wikipedia text is a harder NER domain than newswire. We find that an automatic annotation of Wikipedia has high agreement with WG and, when used as training data, outperforms newswire models by up to 7.7%.
- Matthew Honnibal, Joel Nothman, and James R. Curran (2009).
Evaluating a statistical CCG parser on Wikipedia. In Proceedings of the Workshop on the People’s Web Meets NLP: Collaboratively Constructed Semantic Resources (PeoplesWeb).The vast majority of parser evaluation is conducted on the 1984 Wall Street Journal (WSJ). In-domain evaluation of this kind is important for system development, but gives little indication about how the parser will perform on many practical problems. Wikipedia is an interesting domain for parsing that has so far been underexplored. We present statistical parsing results that for the first time provide information about what sort of performance a user parsing Wikipedia text can expect. We find that the C&C parser’s standard model is 4.3% less accurate on Wikipedia text, but that a simple self-training exercise reduces the gap to 3.8%. The self-training also speeds up the parser on newswire text by 20%.
- Joel Nothman, Tara Murphy, and James R. Curran (2009).
Analysing Wikipedia and gold standard corpora for NER training. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics (EACL).Named entity recognition (NER) for English typically involves three gold standards: MUC, CoNLL, or BBN, all created by costly manual annotation. Recent work has used Wikipedia to automatically create a massive corpus of named entity annotated text.We present a comprehensive cross-corpus evaluation of NER. We identify the causes of poor cross-corpus performance and demonstrate ways of making them more compatible. Using our process, we develop a Wikipedia corpus which outperforms gold standard corpora on cross-corpus evaluation by up to 11%.
- Nicky Ringland, Joel Nothman, Tara Murphy, and James R. Curran (2009).
Classifying articles in English and German Wikipedia. In Proceedings of the Australasian Language Technology Association Workshop (ALTW), pages 20–28.Named Entity (NE) information is critical for Information Extraction (IE) tasks. However, the cost of manually annotating sufficient data for training purposes, especially for multiple languages, is prohibitive, meaning automated methods for developing resources are crucial. We investigate the automatic generation of NE annotated data in German from Wikipedia. By incorporating structural features of Wikipedia, we can develop a German corpus which accurately classifies Wikipedia articles into NE categories to within 1% F-score of the state-of-the-art process in English.
- Joel Nothman (2008).
Learning named entity recognition from Wikipedia.
Honours thesis, University of Sydney.We present a method to produce free, enormous corpora to train taggers for Named Entity Recognition (NER), the task of identifying and classifying names in text, often solved by statistical learning systems.Our approach utilises the text of Wikipedia, a free online encyclopedia, transforming links between Wikipedia articles into entity annotations.Having derived a baseline corpus, we found that altering Wikipedia’s links and identifying classes of capitalised non-entity terms would enable the corpus to conform more closely to gold-standard annotations, increasing performance by up to 32% F$ score.The evaluation of our method is novel since the training corpus is not usually a variable in NER experimentation.We therefore develop a number of methods for analysing and comparing training corpora.Gold-standard training corpora for NER perform poorly (F$ score up to 32% lower) when evaluated on test data from a different gold-standard corpus.Our Wikipedia-derived data can outperform manually-annotated corpora on this cross-corpus evaluation task by up to 7% on held-out test data.These experimental results show that Wikipedia is viable as a source of automatically-annotated training corpora, which have wide domain coverage applicable to a broad range of NLP applications.
- Joel Nothman, James R. Curran, and Tara Murphy (2008).
Transforming Wikipedia into named entity training data. In Proceedings of the Australasian Language Technology Workshop.Statistical named entity recognisers require costly hand-labelled training data and, as a result, most existing corpora are small. We exploit Wikipedia to create a massive corpus of named entity annotated text. We transform Wikipedia’s links into named entity annotations by classifying the target articles into common entity types (e.g. person, organisation and location). Comparing to MUC, CoNLL and BBN corpora, Wikipedia generally performs better than other cross-corpus train/test pairs.
- Baden Hughes, James Haggerty, Joel Nothman, Saritha Manickam, and James R. Curran (2005).
A distributed architecture for interactive parse annotation. In Proceedings of the Australasian Language Technology Workshop (ALTW), pages 207–214.In this paper we describe a modular system architecture for distributed parse annotation using interactive correction. This involves interactively adding constraints to an existing parse until the returned parse is correct. Using a mixed initiative approach, human annotators interact live with distributed CCG parser servers through an annotation GUI. The examples presented to each annotator are selected by an active learning framework to maximise the value of the annotated corpus for machine learners. We report on an initial implementation based on a distributed workflow architecture.
Research Interests¶
event reference¶
We refer to events all the time, and news is often triggered by some types of events. However, events are ill-defined, complicated beasts, both in their ontology (they are diverse and have sub-events, causally-related events, etc), and their linguistic realisation. We need systems that can better understand and work with event reference.
contextual disambiguation¶
Given multiple plausible semantic or referential interpretations of an utterance, how can we determine which is intended? In particular, can we determine what would be sufficient information (and likely acquirable) to make the disambiguation much clearer (increased probability margins), and how do we proceed in seeking that data?
Applies to: information extraction (named entity disambiguation, event linking), answer selection in question answering, parse reranking, dialogue (elicitation) and incremental understanding