Unsupervised learning of word segmentation rules with genetic algorithms and inductive logic programming

Research output: Contribution to journalArticlepeer-review

Abstract

This article presents a combination of unsupervised and supervised learning techniques for the generation of word segmentation rules from a raw list of words. First, a language bias for word se mentation is introduced and a simple genetic algorithm is used in the search for a segmentation that corresponds to the best bias value. In the second phase, the words segmented by the genetic algorithm are used as an input for the first order decision list learner CLOG. The result is a set of first order rules which can be used for segmentation of unseen words. When applied on either the training data or unseen data, these rules produce segmentations which are linguistically meaningful, and to a large degree conforming to the annotation provided.

Original languageEnglish
Pages (from-to)121-162
Number of pages42
JournalMachine Learning
Volume43
Issue number1-2
DOIs
Publication statusPublished - Apr 2001

Keywords

  • unsupervised machine learning
  • inductive logic programming
  • natural language
  • word segmentation

Cite this