Hristo
Hristo

Reputation: 790

Overcoming the Bitap algorithm's search pattern length

I am new to the field of approximate string matching.

I am exploring uses for the Bitap algorithm, but so far its limited pattern length has me troubled. I am working with Flash, and I dispose of 32 bit unsigned integers and a IEEE-754 double-precision floating-point Number type, which can devote up to 53 bits for integers. Still, I would rather have a fuzzy matching algorithm which can handle longer patterns than 50 chars.

The Wikipedia page of the Bitap algorithm mentions libbitap, which supposedly demonstrates an unlimited pattern length implementation of the algorithm, but I have trouble getting the idea from its sources.

Have you got any suggestions about how to generalise Bitap for patterns of unlimited length, or about another algorithm that can perform fuzzy string matching of a needle near a suggested location in the haystack?

Upvotes: 3

Views: 3581

Answers (2)

angstyloop
angstyloop

Reputation: 327

The simplest form of fuzzy match is probably "match with mismatches". Mismatches are sometimes called substitutions. The point is we are not considering deletions or insertions.

Ricardo Baeza-Yates, the author of many versions of Bitapp, also authored an algorithm for "match with mismatches" with Chris Perleberg. The algorithm uses linked lists instead of bit arrays, but the spirit of the algorithm is the same. The paper is cited in the comments.

Here is a C implementation of the Baeza-Yates-Perleberg "match with mismatches" algorithm that uses GLib. It is less optimized than the original implementation, but there are no limits on the size of the pattern or the text.

https://gist.github.com/angstyloop/e4ca495542cd469790ca926ade2fc072

/* g_match_with_mismatches_v3.c
 *
 * Search for fuzzy matches of a pattern in text with @k or fewer mismatches
 * (substitutions). Uses doubly-linked list GList from GLib.
 *
 * COMPILE
 *
 * gcc `pkg-config --cflags glib-2.0` -o g_match_with_mismatches_v3 g_match_with_mismatches_v3.c `pkg-config --libs glib-2.0`
 *
 * RUN
 *
 * ./match_with_mismatches
 *
 * REFS
 *
 * The search and preprocess functions were taken from the example code from
 * "Fast and Practical Approximate String Matching" by Ricardo A. Baeza-Yates
 * and Chris H. Perleberg. I just modified the code.
 */

#include <glib-2.0/glib.h>
#include <stdio.h>

/* Size of alpha index and count array. This number should be twice the size
 * of the alphabet, which is 128 in this case for ASCII (not extended).
 * The purpose of this extra space is explained later.
 */
#define ALPHABET_SIZE 256

/* The Match object will hold the index @i of the first character of the match
 * and the number of mismatched characters @k within that match. The Match
 * objects are also linked list nodes.
 */
typedef struct Match Match;
struct Match {
    int i;
    int k;
};

Match *
Match_new( int i, int k )
{
    Match *t;
    t = g_malloc( sizeof( Match ) );
    t->i = i;
    t->k = k;
    return t;
}

void
Match_free( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    if ( match )
        g_free( match );
}

void
Match_print( gpointer match_, gpointer user_data )
{
    Match *match = (Match *) match_;
    printf( "position: %d, errors: %d\n", match->i, match->k );
}

/* An array of lists. There are 128 linked lists in total - one for each
 * character in our 128-character alphabet. The last 128 lists characters are
 * added to the linked lists as needed.
 *
 * Each list will contain the offsets for each occurence of that character, or a
 * single placeholder offset of -1 if no occurences are found.
 *
 * The offset is the distance of the character to the left of the end of
 * pattern, (i.e., measured by counting to the left from the end of pattern),
 * for a given character and a given instance of pattern.
 */
GList *alpha[ALPHABET_SIZE];

/* This function initializes the @alpha and @count arrays to look like this:
 *
 *     alpha = [ [ -1 ] ] * ALPHABET_SIZE   where the inner brackets are a GList.
 *
 *     count = [m] * m
 *
 * @alpha will be an array of linked lists. Characters in the pattern
 * that do not occur or that occur exactly once in the text will have
 * corresponding linked lists with length one. Characters in the pattern that
 * occur in the text more than once will have corresponding linked lists with
 * length greater than one.
 *
 * The first m - 1  elements of @count will be skipped on the first iteration of
 * the cyclic array (since no match can be shorter than the @pattern). Note that
 * the values in @count are reset to m once they are no longer needed, until the
 * next loop around @count.
 *
 * @p - pattern string
 * @m - pattern length
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
void preprocess( char *p, int m, GList *alpha[], int count[], int max_pattern_size )
{
    int i, j;

    for ( i = 0; i < ALPHABET_SIZE; i++ ) {
        alpha[i] = NULL;
        alpha[i] = g_list_append( alpha[i], GINT_TO_POINTER( -1 ) );
    }

    for ( i = 0, j = 128; i < m; i++, p++ ) {
        if ( GPOINTER_TO_INT( alpha[*p]->data ) == -1 )
            alpha[*p]->data = GINT_TO_POINTER( m - i - 1 );
        else
            alpha[*p] = g_list_append( alpha[*p],
                GINT_TO_POINTER( m - i - 1 ) );
    }

    for ( i = 0; i < max_pattern_size; i++ )
        count[i] = m;

}

void
increment_offset( gpointer off_, gpointer args_ )
{
    gpointer *args = (gpointer *) args_;
    int i = GPOINTER_TO_INT( args[0] );
    int max_pattern_size = GPOINTER_TO_INT( args[1] );
    int *count = (int *) args[2];
    gint off = GPOINTER_TO_INT( off_ ) ;
    count[(i + off) % max_pattern_size]--;
}

/* Find the position of the first character and number of mismatches of every
 * fuzzy match in a string @t with @k or fewer mismatches. Uses the array of
 * GList @alpha and the array of counts @count prepared by the preprocess
 * function.
 * @t - text string
 * @n - length of text string
 * @m - length of the pattern used to create @alpha and @count
 * @k - maximum number of allowed mismatches
 * @alpha - array of GList. See above.
 * @count - circular buffer for counts of matches
 */
GList *
search( char *t, int n, int m, int k, GList *alpha[], int count[], int max_pattern_size )
{
    int i, off, j;
    Match *match;
    GList *l0 = NULL, *l1 = NULL;

    /* Walk the text @t, which has length @n.
     */
    for ( i = 0; i < n; i++ ) {
        /* If the current character in @t is in pattern, its
         * corresponding list in @alpha will have a non-negative offset,
         * thanks to the workdone by the preprocess function. If so, we
         * need to decrement the counts in the circular buffer @count
         * corresponding to the index of the character in the text and
         * the offsets the lists corresponding to those characters,
         * which the preprocess function prepared.
         * 
         * Note that we will only ever need m counts at a time, and
         * we reset them to @m when we are done with them, in case
         * they are needed when the text wraps max_pattern_size
         * characters.
         */
        l0 = alpha[*t++];
        off = GPOINTER_TO_INT( l0->data );
        if ( off >= 0 ) {
            g_assert( l0 );
            gpointer t[3] = {
                GINT_TO_POINTER( i ),
                GINT_TO_POINTER( max_pattern_size ),
                (gpointer) count,
            };
            g_list_foreach( l0, increment_offset, t );
        }

        /* If the count in @count corresponding to the current index in
         * the text is no greater than @k, the number of mismatches we
         * allow, then the pattern instance is reported as a fuzzy
         * match. The position of the first letter in the match is
         * calculated using the pattern length and the index of the last
         * character in the match The number of mismatches is calculated
         * from the number of matches. The first m - 1 elements are
         * skipped.
         */
        if ( i >= m - 1 && count[i % max_pattern_size] <= k ) {
            g_assert( i - m + 1 >= 0 );
            match = Match_new( i - m + 1, count[i % max_pattern_size] );
            l1 = g_list_append( l1, match );
        }

        /* The count in @count corresponding to the current index in
         * text is no longer needed, so we reset it to @m until we
         * need it on the next wraparound.
         */
        count[i % max_pattern_size] = m;
    }

    return l1;
}

/* This is a test harness for the code in this file.
 */
int main()
{

    /* Define the max pattern size. This can be INT_MAX (65535) if you want.
     */
    const int max_pattern_size = 256;

    /* This array is used as a cyclic buffer for counts of the number of matching
     * characters in a given instance of the pattern. The counts are maintained at
     * the end of each pattern. When a pattern with k or fewer mismatches is found,
     * it is reported. As the algorithm steps through the count array, it resets the
     * counts it doesn't need anymore back to m, so they can be reused when the
     * index in the text exceeds reaches the end and needs to wrap around. The first m-1
     * characters will be initialized to max_pattern_size, so they never have a valid
     * number of mismatches.
     */
    int count[max_pattern_size];

    char *text = "xxxadcxxx", *pattern = "abc";
    int n  = strlen( text ), m = strlen( pattern ), k;
    Match *match = NULL;
    GList *l0, *list;

    /* Test Match class
     */
    printf( "\nTesting Match class..\n\n" );
    match = Match_new( 0, 0 );
    g_assert( match );
    Match_print( match, NULL );
    Match_free( match, NULL );
    printf( "\nDone testing Match class.\n\n" );

    /* Test preprocess and search functions.
     */
    printf( "\nTesting \"preprocess\" and \"search\" functions...\n" );

    k = 0;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    g_list_foreach( list, Match_free, NULL  );
    if ( !g_list_length( list ) )
        printf( "No matches.\n" );

    k = 1;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 2;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 0 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    k = 3;
    printf( "\n...with number of allowed errors k = %d\n", k );
    preprocess( pattern, m, alpha, count, max_pattern_size );
    list = search( text, n, m, k, alpha, count, max_pattern_size );
    g_list_foreach( list, Match_print, NULL );
    match = (Match *) g_list_nth_data( list , 3 );
    g_assert(  GPOINTER_TO_INT( match->i ) == 3 );
    g_assert(  GPOINTER_TO_INT( match->k ) == 1 );
    g_list_foreach( list, Match_free, NULL  );

    printf( "\nDone testing \"preprocess\" and \"search\" functions.\n\n" );

    return 0;
} 

Output

Here is the output of the simple compiled example program:

Testing Match class..

position: 0, errors: 0

Done testing Match class.


Testing "preprocess" and "search" functions...

...with number of allowed errors k = 0
No matches.

...with number of allowed errors k = 1
position: 3, errors: 1

...with number of allowed errors k = 2
position: 3, errors: 1

...with number of allowed errors k = 3
position: 0, errors: 3
position: 1, errors: 3
position: 2, errors: 3
position: 3, errors: 1
position: 4, errors: 3
position: 5, errors: 3
position: 6, errors: 3

Done testing "preprocess" and "search" functions.

Here is a small GTK4 application that uses this code:

https://gist.github.com/angstyloop/2281191a3e7fd7e4c615698661fbac24

enter image description here

By dynamically picking the max length of the pattern, you can get a full fuzzy match for free if the strings you are searching are mostly far apart in terms of Hamming distance. Even with insertions and deletions, the string that is closest in terms of Hamming distance will have a small number of mismatches compared to the other strings. The user will have to make many errors, or two of the strings will have to be very close, in order to break that nice behavior. Here is an example: enter image description here

Upvotes: 0

StuffHappens
StuffHappens

Reputation: 6557

There's a pretty crear implementation of this algorithm available at google code. Try it. Though I can't understand how to get an exact location (the beginning and ending point in text) of fuzzy match. If you have any idea how to get both beginning and ending points, please share.

Upvotes: 2

Related Questions