Is it reasonable to merge tokens when indexing a search engine?

Question

I'm working on a search engine using AWS CloudSearch (although I think my question is relevant to search engines in general). Lets say I have a document to be indexed that has a text field called Name with the value Somecompany. Currently, if I run a query such as Somecompany, then of course I get that document in the result.

However, if I run a query like Some company, then I don't get that same result. I have some basic understanding of search engines and an inverted index. I know that the reason I'm not getting the document in the result is because the search engine index is only mapping the document with the token Somecompany. There might be separate document mappings in the index for the token Some and for the token company, but regardless my document is not mapped to either of them (and I wouldn't expect it to be).

Is it possible and/or practical to index the search engine is such a way as to make the Some company search query find the Somecompany document that I'm looking for?

I would also like the reverse situation to work. So, if the document is indexed using the text field Some company, then I'd like the query Somecompany to find that document.

There is a solution that I've been thinking about, but I suspect it might go against the principles of an inverted index and be very inefficient. If I index all my documents with an additional field that contains the Name value compressed (every character except letters and numbers removed), and pre-process every query by compressing the value in the same way, then it should work (all my queries would be prefix searches).

My concern with this solution is that the inverted index will be filled with unique tokens that only map to a single document. Is that a problem? Is there an alternative solution?

alexroussos · Accepted Answer

I am confident your proposed solution to compress queries will work fine from a search performance perspective and you shouldn't worry about at all about indexing unique terms. The main drawback I see is just in losing a lot of the benefits of a search engine, like stemming, stopwords and synonyms, but if you're dealing with company names that are essentially proper nouns then this isn't much of an issue. It does perhaps put a slightly larger burden on the user to spell their query correctly (since entering "sime company" would match one of the words in "some company", while "simecompany" does not match "somecompany" at all, etc), but you can ameliorate this with fuzzy search (the ~ operator) and using a suggester.

I would be wary, though, of letting the odd format of your dataset dictate compromises in how you use search. If it's feasible, you may want to consider breaking those names back into tokens. Breaking strings into dictionary words is fairly simple, but your dictionary would need to contain those company names to be truly effective. I'm loathe to suggest a manual solution but if you're dealing with only a couple thousand, it may be the best option in the long run.

Is it reasonable to merge tokens when indexing a search engine?

Answers (1)

Related Questions