Reputation: 351
I have a CloudSearch domain with a filename
text field. My issue is that a text query won't match (some) documents with filenames I think it (logically) should. If I have documents with these filenames:
and I perform a simple text query of 'cars', I get back files #1, #2, and #4 but not #3. If I search 'cars*' (or do a structured query using prefix) I can match #3. This doesn't make sense to me, especially that #4 matches but #3 does not.
Upvotes: 0
Views: 1043
Reputation: 351
TL;DR It's because of the way the tokenization algorithm handles periods.
When you perform a text search, you're performing a search against processed data, not the literal field. (Maybe that should've been obvious, but it wasn't how I was thinking about it before.)
The documentation gives an overview of how text is processed:
During indexing, Amazon CloudSearch processes text and text-array fields according to the analysis scheme configured for the field to determine what terms to add to the index. Before the analysis options are applied, the text is tokenized and normalized.
The part of the process that's ultimately causing this behavior is the tokenization:
During tokenization, the stream of text in a field is split into separate tokens on detectable boundaries using the word break rules defined in the Unicode Text Segmentation algorithm.
According to the word break rules, strings separated by whitespace such as spaces and tabs are treated as separate tokens. In many cases, punctuation is dropped and treated as whitespace. For example, strings are split at hyphens (-) and the at symbol (@). However, periods that are not followed by whitespace are considered part of the token.
The reason I was seeing the matches described in the question is because the file extensions are being included with whatever precedes them as a single token. If we look back at the example, and build an index according to these rules, it makes sense why a search of 'cars' returns documents #1, #2, and #4 but not #3.
# Text Index
1 'cars' ['cars']
2 'Cars Movie.jpg' ['cars', 'movie.jpg']
3 'cars.pdf'. ['cars.pdf']
4 'cars#.jpg' ['cars', '.jpg']
It might seem like setting a custom analysis scheme could fix this, but none of the options there (stopwords, stemming, synonyms) help you overcome the tokenization problem. I think the only possible solution, to get the desired behavior, is to tokenize the filename (using a custom algorithm) before upload, and then store the tokens in a text array field. Although devising a custom tokenization algorithm that supports multiple languages is a large problem.
Upvotes: 2