Reputation: 3811

Lucene Indexing/Query strategy for hyphenated words

There are many words which are hyphenated or whitespace separated but often used as one word. Eg : Basket Ball or basket-ball can be written as basketball.

Now when i index as sentence, say : "Hey dude, I played basket ball yesterday". Now i try to query "basketball" [without double quotes]..

This case, or in the vice versa case, (index basketball and query basket ball) I will not get any results. Is there any way to solve this problem directly or indirectly ?

Edit:
I gave the example to just to demonstarte the problem. In my actual application scenario, i'll be indexing and searching IDs. If i index : 011 12345,
I should be able to query it using 01112345.

Thanks in advance.

Upvotes: 2

Answers (2)

user326729

Reputation: 11

I am not a Lucene user but here are my 2 cents : Before starting indexing you have to preprocess your data to make it look like the way you want to search it. Do you also want it to appear in the search result if someone searches for just ball? If yes then you have to make two sentences as input made from this single sentence("hey dude, I played basket ball yesterday" & "hey dude, I played basketball yesterday") and index both of them. Is this what you are looking for?

Upvotes: 0

Joel

Reputation: 30156

Hyphens are not the issue here, assuming you are using something like the StandardTokenizer that splits on tokens such as hyphens, then users searching for "basket ball" will match the original text "Basket-Ball" (and vica-versa), so no problem there.

The issue is going between two word and one word equivalents, e.g. "basketball" and "basket ball". You basically need to handle synonyms (e.g. jacket/coat or in your case basketball/ 'basket ball').

You can overcome this by creating a list of equivalent words yourself, or using a dictionary like WordNet, and supplementing either the index or the search with the synonyms for each term. Solr has a SynonymFilter you can probably leverage (also see here).

EDIT:

Here's the code for a very basic synonym filter I wrote a while ago. The synonyms are not externalized, but you an easily add that yourself.

public class SynonymFilter extends TokenFilter {
    private static final Logger log = Logger.getLogger(SynonymFilter.class);

    private Stack<Token> synStack = new Stack<Token>();

    static CharArrayMap<String[]> synLookup = new CharArrayMap<String[]>(5, true);
    static {
        synLookup.put("basketball".toCharArray(), new String[]{"basket ball"});
        synLookup.put("trainer".toCharArray(), new String[]{"sneaker"});
        synLookup.put("burger".toCharArray(), new String[]{"hamburger"});
        synLookup.put("bike".toCharArray(), new String[]{"bicycle", "cycle"});
    }

    // TODO reverse map all the syns to each other e.g. sneaker to trainer

    protected SynonymFilter(TokenStream input) {
        super(input);
    }

    @Override
    public Token next(Token reusableToken) throws IOException {
        if (synStack.size() > 0)
            return synStack.pop();

        Token nextToken = input.next(reusableToken);
        if (nextToken != null) {
            addSynonyms(nextToken);
        }

        return nextToken;
    }

    private void addSynonyms(Token nextToken) {
        char[] word = Arrays.copyOf(nextToken.termBuffer(), nextToken.termLength());
        String[] synonyms = synLookup.get(word);
        if (synonyms != null) {
            for (String s : synonyms) {
                if (!equals(word, s)) {
                    char[] chars = s.toCharArray();
                    Token synToken = new Token(chars, 0, chars.length, nextToken.startOffset(),  nextToken.endOffset());
                    synToken.setPositionIncrement(0);
                    synStack.add(synToken);
                    log.info("Found synonym: " + s + " for: " + new String(nextToken.term()));
                }
            }
        }
    }

public static boolean equals(char[] word, String subString) {
    return equals(word, word.length, subString);
}

public static boolean equals(char[] word, int len, String subString) {

    if (len != subString.length())
        return false;

    for (int i = 0 ; i < subString.length(); i++) {
        if (word[len - i - 1] != subString.charAt(subString.length() - i - 1))
            return false;
    }

    return true;

}
}

Upvotes: 3

Lucene Indexing/Query strategy for hyphenated words

Answers (2)

Related Questions