Build query with different kind of terms for SolR

Question

I have a webapp that could execute a search through an url query on SolR.

The results are received as a Document object.

my query is look like : q=Book:Harlan AND Book:Coben AND .., it works fine.

String[] word = searchedWord.trim().split(" ");
for (int i = 0; i < word.length; i++) {
    if (!StringUtils.isEmpty(word[i])) {
        if (i > 0) {
            query.append("%20AND%20");
        } 
        String utf_encoded = URLEncoder.encode(StringEscapeUtils.escapeJava(word[i]), "UTF-8");
    }
}

But i need to enforce the kind of searched terms, because when the searched term is like an exact term : "Harlan Coben", this code split it in two word "Harlan and Coben"

Per example, my webapp should be able to search:

Exact terms: "Harlan Coben"

Multiple terms: shakespeare harlan coben

Multiple mixed terms: shakespeare "harlan coben" coben or shakespear "harlan coben" or "harlan coben" coben

The URL to call SolR is encoded in UTF-8 to replace special characters..

How should i proceed ? by regular expressions ? or is there another way ?

------ EDIT --------

To be more specific, All of these characters could be "@(!ùéàç" or chinese/russian or anything else characters (unicode?) from a specific language.

I need to match them and separate them to prepare the SolR query.

Example:

If the search term is : coben "Harlan Coben" s(554603)hakesdpeare Straße Привет My regex should match and give me this result:

 coben
 "Harlan Coben"
 s(554603)hakesdpeare
 Straße
 Привет

Then i need to concatenate each of them with AND Book: or juste Book: to have a query as below:

q=Book:coben AND Book:"Harlan Coben" AND Book:s(554603)hakesdpeare AND Book:Straße AND Book:Привет

I tried ("[a-z]+(?:\s+[a-z]+)+"|[a-z]+)(?:\s+|$) from @fge (thanks for that), but it match only with [a-z], i tried this with \p{all} but didn't work..

Any idea ?

------ END EDIT --------

Thanks for help !

malak · Accepted Answer

I finally found the correct regex to match any characteres (including chinese or other languages) and give me each word of the search:

Example if the search is :

harlan coben "harlan coben"

Each match found will be:

harlan
coben
"Harlan coben"

Here is the used code:

Pattern PATTERN = Pattern.compile("(?>"[^"]+"+)|(?>[^ ]+)+");
Matcher match = PATTERN.matcher(motRecherche);
match.reset();
int iM = 0;

while(match.find()){
    if(iM > 0){
        query.append("%20AND%20");
    }

    String utf_encoded = CommonUtils.escapeSolrQuery(match.group(0));
    query.append(":"+utf_encoded);
    iM++;
}

Another thing about SolR, it's needed to escape some special characters + - && || ! ( ) { } [ ] ^ " ~ * ? : \ , SolR provides a client class called ClientUtils and a method escapeQueryChars() that i change for me:

public static String escapeSolrQuery(String searchWord){

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < searchWord.length(); i++) {
          char c = searchWord.charAt(i);

          if (c == '\' || c == '+' || c == '-' || c == '!'  || c == '(' || c == ')' || c == ':'
            || c == '^' || c == '[' || c == ']' || c == '{' || c == '}' || c == '~'
            || c == '*' || c == '?' || c == '|' || c == '&'  || c == ';' || c == '/') 
          {
            sb.append('\');
          }

          if(c == '"' && !searchWord.matches(""[^"]+"")){
              sb.append('\');
          }
          sb.append(c);
        }
       return sb.toString();
    }

Now it works fine :)

Build query with different kind of terms for SolR

Answers (2)

Related Questions