Segmentation and Normalisation in Grapheme Codebooks

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The grapheme codebook is a high-performing technique for offline writer identification. This paper considers whether the de facto standards for initial grapheme extraction are optimal for both modern and historical datasets. We examine the construction and representation of the graphemes that comprise the codebook, testing three segmentation methods and two grapheme size normalisation methods on two datasets: a 93-writer IAM dataset, and a 43-writer medieval English dataset. The standard minima-split segmentation is compared to a complementary segmentation method that preserves ligature shapes, as well as the union of both these methods. Classification performance for each method is compared on a range of codebook sizes. We demonstrate that grapheme aspect-ratio is not always a writer-specific feature, and that preserving the character body shape in segmentation is more informative than preserving cursive text ligatures.
Original languageEnglish
Title of host publication2011 International Conference on Document Analysis and Recognition, ICDAR 2011, Beijing, China, September 18-21, 2011
PublisherIEEE
Pages613-617
Number of pages5
ISBN (Electronic)9780769545202
ISBN (Print)9781457713507
DOIs
Publication statusPublished - 2011

Cite this