Conventional document search techniques are constrained by attempting to match individual keywords or phrases to source documents. Thus, these techniques miss out documents that contain semantically similar terms, thereby achieving a relatively low degree of recall. At the same time, processing capabilities and tools for syntactic and semantic analysis of language have advanced to the point where an index-time linguistic analysis of source documents is both feasible and realistic. In this paper, we introduce document dimensions, a means of classifying or grouping terms discovered in documents. Using an enhanced version of Jakarta Lucene, we demonstrate that supplementing keyword analysis with some syntactic and semantic information can indeed enhance the quality of information retrieval results.
|Published - 2004