bradley
bradley

Reputation: 776

Partial matching GAE search API

Using the GAE search API is it possible to search for a partial match?

I'm trying to create autocomplete functionality where the term would be a partial word. eg.

> b
> bui
> build

would all return "building".

How is this possible with GAE?

Upvotes: 13

Views: 6931

Answers (6)

jar kir ang 1
jar kir ang 1

Reputation: 55

Jumping in very late here.

But here is my well documented function that does tokenizing. The docstring should help you understand it well and use it. Good luck!!!

def tokenize(string_to_tokenize, token_min_length=2):
  """Tokenizes a given string.

  Note: If a word in the string to tokenize is less then
  the minimum length of the token, then the word is added to the list
  of tokens and skipped from further processing.
  Avoids duplicate tokens by using a set to save the tokens.
  Example usage:
    tokens = tokenize('pack my box', 3)

  Args:
    string_to_tokenize: str, the string we need to tokenize.
    Example: 'pack my box'.
    min_length: int, the minimum length we want for a token.
    Example: 3.

  Returns:
    set, containng the tokenized strings. Example: set(['box', 'pac', 'my',
    'pack'])
  """
  tokens = set()
  token_min_length = token_min_length or 1
  for word in string_to_tokenize.split(' '):
    if len(word) <= token_min_length:
      tokens.add(word)
    else:
      for i in range(token_min_length, len(word) + 1):
        tokens.add(word[:i])
  return tokens

Upvotes: 0

Desmond Lua
Desmond Lua

Reputation: 6290

Though LIKE statement (partial match) is not supported in Full Text Search, but you could hack around it.

First, tokenize the data string for all possible substrings (hello = h, he, hel, lo, etc.)

def tokenize_autocomplete(phrase):
    a = []
    for word in phrase.split():
        j = 1
        while True:
            for i in range(len(word) - j + 1):
                a.append(word[i:i + j])
            if j == len(word):
                break
            j += 1
    return a

Build an index + document (Search API) using the tokenized strings

index = search.Index(name='item_autocomplete')
for item in items:  # item = ndb.model
    name = ','.join(tokenize_autocomplete(item.name))
    document = search.Document(
        doc_id=item.key.urlsafe(),
        fields=[search.TextField(name='name', value=name)])
    index.put(document)

Perform search, and walah!

results = search.Index(name="item_autocomplete").search("name:elo")

https://code.luasoftware.com/tutorials/google-app-engine/partial-search-on-gae-with-search-api/

Upvotes: 31

jeissonp
jeissonp

Reputation: 331

My version optimized: not repeat tokens

def tokenization(text):
    a = []
    min = 3
    words = text.split()
    for word in words:
        if len(word) > min:
            for i in range(min, len(word)):
                token = word[0:i]
                if token not in a:
                    a.append(token)
    return a

Upvotes: 0

Ahmad Muzakki
Ahmad Muzakki

Reputation: 1088

just like @Desmond Lua answer, but with different tokenize function:

def tokenize(word):
  token=[]
  words = word.split(' ')
  for word in words:
    for i in range(len(word)):
      if i==0: continue
      w = word[i]
      if i==1: 
        token+=[word[0]+w]
        continue

      token+=[token[-1:][0]+w]

  return ",".join(token)

it will parse hello world as he,hel,hell,hello,wo,wor,worl,world.

it's good for light autocomplete purpose

Upvotes: 3

nguy&#234;n
nguy&#234;n

Reputation: 5336

I have same problem for typeahead control, and my solution is parse string to small part :

name='hello world'
name_search = ' '.join([name[:i] for i in xrange(2, len(name)+1)])
print name_search;
# -> he hel hell hello hello  hello w hello wo hello wor hello worl hello world

Hope this help

Upvotes: 0

Thanos Makris
Thanos Makris

Reputation: 3115

As described at Full Text Search and LIKE statement, no it's not possible, since the Search API implements full text indexing.

Hope this helps!

Upvotes: 2

Related Questions