Diewald (2008)

Diewald, Nils (2008): Approximativer Stringvergleich in selbstorganisierenden Trie-Strukturen: Masterarbeit, Universität Bielefeld.
BibTeX Thesis

Abstract

My master thesis discusses string queries to lexicons with skewed distribution, such as those that occur when parsing text, i.e., some strings are queried more frequently than others and the internal structure of the strings is also skewed, i.e., one character in the string is followed by some characters more frequently than others.

Lexical implementations in trie form according to Fredkin, de la Briandais, and Clampett are considered (Chapter 2).

In follow-up Chapter 3, self-organization mechanisms (Move-to-Front, Transpose, and Frequency Count for list tries, Simple Exchange, Move-To-Root, and Splaying for TSTs) are applied to the latter two structures to optimize them for skewed queries. Experiments are conducted to investigate their efficiency for real data.

In another chapter (4), various mechanisms for fuzzy comparison of strings are presented, with the Levenshtein-Damerau distance proving to be best suited for fuzzy queries to lexicons in trie structures.

In a final 5th chapter, fuzzy string comparison is implemented in the self-organizing trie structures and experiments are conducted to investigate whether, due to self-organization, the hit quality and speed of responses are positively affected.

It is shown that self-organization can have a positive effect on fuzzy string queries to lexicons (Chapter 6). A possible application for this would be orthography programs that take into account all previous queries to the lexicon, i.e., all queries of the text up to the searched word, for correction suggestions, and thus also give a higher score to rare words if they are the subject of the text. This is contrasted with the computational effort that must be factored into self-organization.

In another chapter I had treated approximate string comparison in finite state transducers. This chapter was dropped in the end, parts of it (though not the approximate string comparison) can be found in a term paper on Finite State Morphology.

A program for finding similar and dissimilar words in a corpus according to the Levenshtein-Damerau distance can also be found on these pages.

The Perl programs created in the course of this work are to be regarded as prototypes. They are by no means optimized. The goal was to make the program structure as maintainable and understandable as possible. I have not thought about a license yet ...