Shawn Aten
Shawn Aten

Reputation: 351

Why won't CloudSearch find substring matches in filename text field?

I have a CloudSearch domain with a filename text field. My issue is that a text query won't match (some) documents with filenames I think it (logically) should. If I have documents with these filenames:

  1. 'cars'
  2. 'Cars Movie.jpg'
  3. 'cars.pdf'
  4. 'cars#.jpg'

and I perform a simple text query of 'cars', I get back files #1, #2, and #4 but not #3. If I search 'cars*' (or do a structured query using prefix) I can match #3. This doesn't make sense to me, especially that #4 matches but #3 does not.

Upvotes: 0

Views: 1043

Answers (1)

Shawn Aten
Shawn Aten

Reputation: 351

TL;DR It's because of the way the tokenization algorithm handles periods.

When you perform a text search, you're performing a search against processed data, not the literal field. (Maybe that should've been obvious, but it wasn't how I was thinking about it before.)

The documentation gives an overview of how text is processed:

During indexing, Amazon CloudSearch processes text and text-array fields according to the analysis scheme configured for the field to determine what terms to add to the index. Before the analysis options are applied, the text is tokenized and normalized.

The part of the process that's ultimately causing this behavior is the tokenization:

During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm.

According to the word break rules, strings separated by whitespace such as spaces and tabs are treated as separate tokens. In many cases, punctuation is dropped and treated as whitespace. For example, strings are split at hyphens (-) and the at symbol (@). However, periods that are not followed by whitespace are considered part of the token.

The reason I was seeing the matches described in the question is because the file extensions are being included with whatever precedes them as a single token. If we look back at the example, and build an index according to these rules, it makes sense why a search of 'cars' returns documents #1, #2, and #4 but not #3.

#    Text                Index

1    'cars'              ['cars']
2    'Cars Movie.jpg'    ['cars', 'movie.jpg']
3    'cars.pdf'.         ['cars.pdf']
4    'cars#.jpg'         ['cars', '.jpg']

Possible Solutions

It might seem like setting a custom analysis scheme could fix this, but none of the options there (stopwords, stemming, synonyms) help you overcome the tokenization problem. I think the only possible solution, to get the desired behavior, is to tokenize the filename (using a custom algorithm) before upload, and then store the tokens in a text array field. Although devising a custom tokenization algorithm that supports multiple languages is a large problem.

Upvotes: 2

Related Questions