balki
balki

Reputation: 27664

search for a string in a list of string ranges

I have a list of lexicographical ranges, for example [a,an) [an,bb) [bb,d) [c,h) Given a string say apple, I need to find which range it belongs to. In this case it is in the second range. If the string could belong to multiple ranges, the first one needs to be returned. Eg: cat should return range3 and not range4.

Brute force approach would be to loop through the list in order and check if the string fits in there. Better approach would be to resolve overlaps first, sort the ranges and do a binary search. Any suggestions for further optimized algorithm? Also implementation tips for c++ is welcome. This logic happens to occur on a critical execution path and has to be fast.

Update: Yes, there could be gaps in the ranges. Yes binary search can make it O(log(n)). Is there someway I can come up with a hash and make it even better? How would the hash look like? We can assume we have only lowercase characters in all the strings and ranges.

Upvotes: 2

Views: 682

Answers (7)

wildplasser
wildplasser

Reputation: 44250

My approach would be

  • a range has two limits (the lower and upper limit)
  • each range partitions the space into three parts (below, inside, above)
  • each limit partitions the space into two parts (below, above_or_equal)

So the method could be:

  • number the ranges
  • decompose the ranges into two limits
  • put the limits into a tree, in nodes containing two lists with the ranges that refer to them (one list for nodes that use this limit as lower limit, one for upper limit)
    • these lists can be bitmaps, since the ranges are numbered
  • to find a string
    • you walk the tree, and every time you step down you actually cross a limit and gain knowledge about which limits you have to your right/left, and which ranges you are left/right/inside.
    • you need two additional lists (of range numbers) to do this traversal.
    • these lists can be bitmaps
    • every time you cross a border you add the range number from one of the lists and remove it from the other.
  • once you are inside a range (x >= lower limit && x < upper limit; with the limits corresponding to the same range of course) the algorihtm finishes.
    • (given that this is actually the range with the lowest number: first match)
    • this can be detected if the two lists share one or more members
    • we want the lowest-numbered overlapping member.

Since this method is a tree search, it has O(log(N)) complexity.

UPDATE: On second thought, bitmaps are not good way to store the usage lists or the results. A linked list (actually two) is better. Code is 300 lines. Should I post it here ?

#include <limits.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>

#define COUNTOF(a) (sizeof a / sizeof a[0])
#define WANT_DUPLICATES 0

struct range {
                /* separate linked lists for {lo,hi}) 
                ** for limits that this range participates in. */
        struct range *next_lo;
        struct range *next_hi;
        unsigned num;
        char *str_lo, *str_hi;
        } ranges[] =
{{NULL,NULL,0, "a","an"}, {NULL,NULL,1, "an", "bb"}
,{NULL,NULL,2, "bb", "d"}, {NULL,NULL,3, "c", "h"}
,{NULL,NULL,4, "bb", "c"}, {NULL,NULL,5, "c", "d"}
};
#define NRANGE COUNTOF(ranges)

void add_range(struct range *pp);
void list_dmp(FILE *fp, int isupper, struct range *bp);

struct treetwo {
        struct treetwo *prev;
        struct treetwo *next;
        char *str;
        struct range *list_lo; /* ranges that use str as lower limit */
        struct range *list_hi; /* ranges that use str as upper limit */
        };

struct treetwo *root = NULL;
struct treetwo ** find_hnd(struct treetwo **tpp, char *str);
void tree_dmp(FILE *fp, struct treetwo *tp, char *msg, unsigned indent);

struct result {
        unsigned size;
        unsigned used;
        struct {
                unsigned num;
                unsigned mask;
                } *entries;
        };
#define MASK_BELOW_LO 1
#define MASK_ABOVE_LO 2
#define MASK_BELOW_HI 4
#define MASK_ABOVE_HI 8

int result_resize(struct result *res, unsigned newsize);
void init_structures(void);
struct result *find_matches (char *str);
unsigned update_state(struct result *rp, struct treetwo *tp, int isabove);

int main (void)
{
char buff[100];
struct result *res;
size_t pos;
unsigned idx;
static char *legend[4] = { "unknown", "below", "above",  "impossible"};

init_structures();
tree_dmp(stderr, root, "Root", 0);

while (fgets (buff, sizeof buff, stdin) ) {
        pos=strcspn(buff, "\r\n");
        buff[pos] = 0;
        res = find_matches (buff);
        for (idx=0; idx < res->used; idx++) {
                unsigned num = res->entries[idx].num;
                unsigned mask = res->entries[idx].mask;
                fprintf(stdout, "[%u]Range%u %x: '%s' %s '%s' and '%s' %s '%s'\n"
                        , idx, num, mask
                        , buff, legend[mask & 3], ranges[num].str_lo
                        , buff, legend[(mask>>2) & 3], ranges[num].str_hi
                        );
                }
        }

return 0;
}

unsigned update_state(struct result *rp, struct treetwo *tp, int isabove)
{
struct range *p;
unsigned mask_lo, mask_hi;
unsigned hitcnt,idx;
/* State: (lower limit)
** 0 : unknown
** MASK_BELOW_LO: below limit
** MASK_ABOVE_LO: above limit
** 3: impossible
** State: (upper limit)
** 0 : unknown
** MASK_BELOW_HI: below limit
** MASK_ABOVE_HI: above limit
** c: impossible
** Combined states:
** required state 2|4 := 6
** 5: unreachable
** a: unreachable
** 9: impossible
** f: impossible
*/

if (!tp) return 0;
hitcnt=0;
mask_lo = (isabove>=0) ? MASK_ABOVE_LO : MASK_BELOW_LO;
mask_hi = (isabove>=0) ? MASK_ABOVE_HI : MASK_BELOW_HI;

fprintf(stderr , "Update_state(start{%s}, isabove=%d, mask=%x,%x)\n"
        , tp->str , isabove, mask_lo, mask_hi);
fprintf(stderr , "Update_state(Lo=%s)=", tp->str);
list_dmp(stderr , 0, tp->list_lo);
idx=0;
for (p = tp->list_lo; p ; p = p->next_lo) {
        unsigned num = p->num;
        fprintf(stderr , "Update_state:[%u] |= %u", num, mask_lo );
        for (   ;idx < rp->used;idx++) { if (rp->entries[idx].num >= num) break; }
        if ( idx < rp->used ) {
                fprintf(stderr , " Old was:%u\n", rp->entries[idx].mask );
                rp->entries[idx].mask |= mask_lo;
                if (rp->entries[idx].mask == (MASK_ABOVE_LO|MASK_BELOW_HI)) hitcnt++;
                continue;
                }
        if ( idx >= rp->used) {
                if ( rp->used >= rp->size && result_resize(rp, rp->size ? rp->size*2 : 8)) break;
                fprintf(stderr , " New at:%u\n", idx );
                rp->entries[idx].num = num;
                rp->entries[idx].mask = mask_lo;
                rp->used++;
                }
        }

fprintf(stderr , "Update_state(Hi=%s)=", tp->str);
list_dmp(stderr , 1, tp->list_hi);
idx=0;
for (p = tp->list_hi; p ; p = p->next_hi) {
        unsigned num = p->num;
        fprintf(stderr , "Update_state:[%u] |= %u", num, mask_lo );
        for (   ;idx < rp->used;idx++) { if (rp->entries[idx].num >= num) break; }
        if ( idx < rp->used ) {
                fprintf(stderr , " Old was:%u\n", rp->entries[idx].mask );
                rp->entries[idx].mask |= mask_hi;
                if (rp->entries[idx].mask == (MASK_ABOVE_LO|MASK_BELOW_HI)) hitcnt++;
                continue;
                }
        if ( idx >= rp->used) {
                if ( rp->used >= rp->size && result_resize(rp, rp->size ? rp->size*2 : 8)) break;
                fprintf(stderr , " New at:%u\n", idx );
                rp->entries[idx].num = num;
                rp->entries[idx].mask = mask_hi;
                rp->used++;
                }
        }
return hitcnt;
}

struct result *find_matches (char *str)
{
int rc;
struct treetwo **hnd;
struct result *res = malloc (sizeof *res);
unsigned dst,src;
res->used=res->size=0; res->entries=0;

for (hnd= &root; *hnd; hnd = (rc < 0) ? &(*hnd)->prev : &(*hnd)->next ) {
        rc = strcmp( str, (*hnd)->str);
        fprintf(stderr, "####\nStr=%s Node={%s} rc=%d\n"
                , str, (*hnd)->str, rc );
        list_dmp(stderr , 0, (*hnd)->list_lo );
        list_dmp(stderr , 1, (*hnd)->list_hi );
        rc = update_state(res, *hnd , rc);
#if WANT_DUPLICATES
        continue;
#else
        /* if we don't want duplicates we can bail out on the first match */
        if (rc) break;
#endif
        }


/* Now cleanup the results.
** Below(lower limit) and above(upper limit) and variations can be removed.
** Some results are incomplete, because one of there limits is out
** of reach (shadowed by a narrower range).  We'll have to recompute these.
** The result structure is compacted: if entries are deleted, the remaining ones are shifted down.
** Note: part of this cleanup (removal of unreacheables) could be done in update_state(),
** that would keep the array with partial results as short as possible.
*/
for (dst=src=0; src < res->used; src++) {
        int rc;
        unsigned num = res->entries[src].num;
rescan:
        switch (res->entries[src].mask & 0xf) {
        default: break;
        case 0: /* impossible */
                goto rescan;
#if WANT_DUPLICATES
        case MASK_ABOVE_LO:
                rc = strcmp(str, ranges[num].str_hi);
                res->entries[src].mask |= (rc >=0) ? MASK_ABOVE_HI : MASK_BELOW_HI;
                goto rescan;
        case MASK_BELOW_HI:
                rc = strcmp(str, ranges[num].str_lo);
                res->entries[src].mask |= (rc >=0) ? MASK_ABOVE_LO : MASK_BELOW_LO;
                goto rescan;
#endif
        case MASK_BELOW_HI|MASK_ABOVE_LO:
                if (dst != src) res->entries[dst] = res->entries[src];
                dst++;
                }
        }
fprintf(stderr, "####\nFinal pass: %u/%u\n", dst, res->used );
res->used = dst;
return res;
}

void init_structures(void)
{
unsigned idx;

for (idx = 0; idx < NRANGE; idx++) {
        add_range( &ranges[idx]);
        }
}

void list_dmp(FILE *fp, int isupper, struct range *bp)
{

fprintf(fp, "%s", (isupper) ? "Upper" :"Lower"  );
for (   ; bp  ; bp = (isupper) ? bp->next_hi : bp->next_lo) {
        fprintf(fp, " %u:{%s,%s}"
        , bp->num , bp->str_lo , bp->str_hi
        );
        }
fprintf( stdout, "\n" );
}

void add_range(struct range *pp)
{
struct treetwo **tpp;
struct range **rpp;

fprintf(stderr, "Inserting range %u->{%s,%s}\n", pp->num, pp->str_lo, pp->str_hi);
        /* find low boundary for this interval */
tpp = find_hnd (&root, pp->str_lo);
if (!*tpp) {
        fprintf(stderr, "Creating node for %u->%s (low)\n", pp->num, pp->str_lo);
        *tpp = malloc(sizeof **tpp);
        (*tpp)->list_lo = NULL;
        (*tpp)->list_hi = NULL;
        (*tpp)->str = pp->str_lo;
        }
for (rpp = &(*tpp)->list_lo; *rpp ; rpp = &(*rpp)->next_lo) {;}
*rpp = pp;
fprintf(stderr, "Added range %u->{%s,%s} to treenode(%s)->list_lo\n"
        , pp->num, pp->str_lo, pp->str_hi
        , (*tpp)->str
        );

        /* find high boundary */
tpp = find_hnd (&root, pp->str_hi);
if (!*tpp) {
        fprintf(stderr, "Creating node for %u->%s (High)\n", pp->num, pp->str_hi);
        *tpp = malloc(sizeof **tpp);
        (*tpp)->list_lo = NULL;
        (*tpp)->list_hi = NULL;
        (*tpp)->str = pp->str_hi;
        }
for (rpp = &(*tpp)->list_hi; *rpp ; rpp = &(*rpp)->next_hi) {;}
*rpp = pp;
fprintf(stderr, "Added range %u->{%s,%s} to treenode(%s)->list_hi\n"
        , pp->num, pp->str_lo, pp->str_hi
        , (*tpp)->str
        );
}

struct treetwo ** find_hnd(struct treetwo **tpp, char *str)
{
int rc;

for (   ; *tpp; tpp = (rc < 0) ? &(*tpp)->prev : &(*tpp)->next ) {
        rc = strcmp( str, (*tpp)->str);
        if (!rc) break;
        }
return tpp;
}

void tree_dmp(FILE *fp, struct treetwo *tp, char *msg, unsigned indent)
{
unsigned uu;
if (!tp) return;
if (!msg) msg = "";

for (uu=0; uu < indent; uu++) { fputc( ' ', fp); }
fprintf(fp, "%s:{%s}\n", msg, tp->str );

for (uu=0; uu < indent+1; uu++) { fputc( ' ', fp); }
list_dmp(fp , 0, tp->list_lo);

for (uu=0; uu < indent+1; uu++) { fputc( ' ', fp); }
list_dmp(fp , 1, tp->list_hi);

tree_dmp(fp, tp->prev, "Prev", indent+2);
tree_dmp(fp, tp->next, "Next", indent+2);
}
int result_resize(struct result *res, unsigned newsize)
{
void *old;

old = res->entries;
res->entries = realloc ( res->entries , newsize * sizeof *res->entries);
if ( !res->entries) {
        res->entries = old; return -1;
        }
res->size = newsize;
if (res->used > newsize) res->used = newsize;
return 0;
}

Upvotes: 0

user1196549
user1196549

Reputation:

I wouldn't be surprised that your set of ranges can be represented with a trie (http://en.wikipedia.org/wiki/Trie). Once the trie is filled, the query time should not exceed the length of the longest range bound nor the length of the query string.

This is optimal in terms of query time (in fact O(1) in your computational model).

Upvotes: 0

user1196549
user1196549

Reputation:

You might approach this by "gridding".

Use an array with 26 entries corresponding to the first letter. Every bin contains the list of ranges having a nonempty intersection with it. Like

'a' -> [a,an) [an,bb), 'b' -> [an,bb) [bb,d), 'c' -> [bb,d) [c,h) ...

You easily generalize the idea to a prefix of a few letters

'aaa' -> [a,an), 'aab' -> [a,an), 'aac' -> [a,an) ...

This can much shorten the list of ranges to be tried, especially if there are little overlaps, at the expense of storage and preprocessing time.

A special convention can be used to indicate that a bin is wholly covered.

Happy distributions can lead to O(1), I guess.

Upvotes: 0

AShelly
AShelly

Reputation: 35540

If you have memory to spare and are limited to lowercase, you can build a multi-way tree. Top node has an array of 26 pointers. Pointers are Null if no range starts with that character. They point to a range if all words starting with that character fall into the range, and point to another node if the ranges split on a following character. (so given [aa-ap],[ar-bl]; the 'a' entry would point to another node where entries 'a' through 'p' pointed to range 1, entry 'q' was null, and 'r' thru 'z' pointed to range 2. )

This should be O(max(range_specifier)).

Upvotes: 0

Ivaylo Strandjev
Ivaylo Strandjev

Reputation: 70939

Here is what you should do:

First sort the ranges with respect to their beginnings in lexicographical order. Then you should do the following pre-processing on them - for each range make it's beginning the greater of it's begining and the end of the previous range(if this makes the current range empty, simply ignore it). You do that because if a word is before the end of the previous range, then it will belong to some of the previous ranges and will never be classified in the current one. After this pre-processing all the ranges are non-overlapping and so each word you search for will belong to at most one of them. So all you need to do is to perform a binarry search on the resulting pre-processed ranges which will be in O(log(n)) complexity. I doubt you can achieve better complexity for this problem.

Upvotes: 2

High Performance Mark
High Performance Mark

Reputation: 78314

Some kind of index to the start of each range, perhaps a binary tree, would probably be a good idea. Not sure if you need to index to the end of each range, unless there may be gaps.

Upvotes: 1

Siddharth
Siddharth

Reputation: 9574

One solution comes to my mind, may be you can sort the word apple and identify the character that comes last in the a-z order. And just check for that one character in your ranges. Thinking more...

Upvotes: 0

Related Questions