Classical approaches to error-tolerant search
This is a brief description of the most relevant "classical" approaches to tackle the
problem of error-tolerant search.
Deterministic Finite Automat on (DFA or Transducer)
DFAs are based on regular languages, widely used in compilers and interpreters to
quickly check words for membership in a defined word set. The retrieval in this kind of
structure is based on a directed walk through a set of states. This is usually extremely
fast and independent of the size of the dictionary. Error tolerance can only be added
by either considering expected variations of the query string or by adding such to the
dictionary. This ends up being a trial-and-error approach, which is naturally very
limited.
In the former case the retrieval time grows exponentially with the number of allowed
errors and the latter is not manageable in terms of memory consumption for large
dictionaries. In general this approach is maximally good for up to 2 corrections.
The major drawback of this approach is that resource consumption increases exponentially with
the number of corrections.
Hash Tables
There are a wide range of implementations of hash tables. All of them have the
advantage that the procedure is extremely fast (constant), but not fault tolerant at all.
Error-tolerance is usually applied by doubling the hash table and by accessing it with
the so-called "front hash"-"back hash" procedure, assuming that an error occurs only in
one half of the word. Memory resources are comparably small. There are implementations which have a built-in
hash for special situations, where time can be saved when a perfect match is
sufficient. There is no commercially available solution that relies on this technology.
Associative Memory (Neural Nets)
This approach relies heavily on a learning scheme. Memory consumption is significant
because a large set of learned coupling constants have to be stored. Retrieval time
depends on the dictionary size, but not on the number of errors. The biggest
disadvantage of neural nets is the lack of result transparency. The results and their
credibility heavily depend on random factors during the learning period. Also the
overall performance is pretty slow. Other associative memory approaches try to define
an objective set of "features" (such as bigrams) in order to increase transparency.
However, these features are often not very distinctive and produce a lot of "cross-talk",
in other words the retrieval of unwanted matches.
Bipartite Matching
A totally different approach is bipartite matching. Here a cost function is introduced for
displacing a character. The idea is to match each query string to the reference string
with the lowest possible displacement cost. Unfortunately this technology can only be
used by "compiling" the query and running it against the data, with a general
performance as fast as grep or other scanners. Also the results are intuitively similar,
but there are no clear criteria for the results quality. Developers of solutions that are
based on this technology admit that calculating the edit distance
would be clearly superior, but that they had no efficient way to implement it.
Longest-Common-Subsequence (LCS)
This is considered a simplified variation of the Levenshtein algorithm. It has
drawbacks, since directly adjacent sequences of characters are receive the same
credibilities as scattered sequences. In cases where a clear relationship between
query length and dictionary entry length cannot be given, it is a convenient algorithm
for approximate retrieval, but has also no efficient implementation.
GlobStyle, Regular Expressions
The most common method of imprecise querying is using regular expression or glob
style matching. Regular expressions (egrep) are cumbersome to use and if not
specified with care, the correct entry can still be missed. As a result reliability and
transparency are not provided. Search performance is so poor for full-table scans
of large databases, that the technology is not
commercially viable.
|