Hierarchical word clustering - automatic thesaurus generation

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper, we propose a hierarchical, lexical clustering neural network algorithm that automatically generates a thesaurus (synonym abstraction) using purely stochastic information derived from unstructured text corpora and requiring no prior word classifications. The lexical hierarchy overcomes the Vocabulary Problem by accommodating paraphrasing through using synonym clusters and overcomes Information Overload by focusing search within cohesive clusters. We describe existing word categorisation methodologies, identifying their respective strengths and weaknesses and evaluate our proposed approach against an existing neural approach using a benchmark statistical approach and a human generated thesaurus for comparison. We also evaluate our word context vector generation methodology against two similar approaches to investigate the effect of word vector dimensionality and the effect of the number of words in the context window on the quality of word clusters produced. We demonstrate the effectiveness of our approach and its superiority to existing techniques. (C) 2002 Elsevier Science B.V. All rights reserved.

Original languageEnglish
Pages (from-to)819-846
Number of pages28
JournalNeurocomputing
Volume48
Issue number1-4
DOIs
Publication statusPublished - Oct 2002

Bibliographical note

Copyright © 2002 Elsevier Science B.V. This is an author produced version of a paper published in Neurocomputing. This paper has been peer-reviewed but does not include the final publisher proof-corrections or journal pagination.

Keywords

  • neural network
  • hierarchical thesaurus
  • lexical
  • synonym clustering

Cite this