malak
malak

Reputation: 27

Build query with different kind of terms for SolR

I have a webapp that could execute a search through an url query on SolR.

The results are received as a Document object.

my query is look like : q=Book:Harlan AND Book:Coben AND .., it works fine.

String[] word = searchedWord.trim().split(" ");
for (int i = 0; i < word.length; i++) {
    if (!StringUtils.isEmpty(word[i])) {
        if (i > 0) {
            query.append("%20AND%20");
        } 
        String utf_encoded = URLEncoder.encode(StringEscapeUtils.escapeJava(word[i]), "UTF-8");
    }
}

But i need to enforce the kind of searched terms, because when the searched term is like an exact term : "Harlan Coben", this code split it in two word "Harlan and Coben"

Per example, my webapp should be able to search:

Exact terms: "Harlan Coben"

Multiple terms: shakespeare harlan coben

Multiple mixed terms: shakespeare "harlan coben" coben or shakespear "harlan coben" or "harlan coben" coben

The URL to call SolR is encoded in UTF-8 to replace special characters..

How should i proceed ? by regular expressions ? or is there another way ?

------ EDIT --------

To be more specific, All of these characters could be "@(!ùéàç" or chinese/russian or anything else characters (unicode?) from a specific language.

I need to match them and separate them to prepare the SolR query.

Example:

If the search term is : coben "Harlan Coben" s(554603)hakesdpeare Straße Привет My regex should match and give me this result:

 coben
 "Harlan Coben"
 s(554603)hakesdpeare
 Straße
 Привет

Then i need to concatenate each of them with AND Book: or juste Book: to have a query as below:

q=Book:coben AND Book:"Harlan Coben" AND Book:s(554603)hakesdpeare AND Book:Straße AND Book:Привет

I tried ("[a-z]+(?:\s+[a-z]+)+"|[a-z]+)(?:\s+|$) from @fge (thanks for that), but it match only with [a-z], i tried this with \\p{all} but didn't work..

Any idea ?

------ END EDIT --------

Thanks for help !

Upvotes: 0

Views: 348

Answers (2)

malak
malak

Reputation: 27

I finally found the correct regex to match any characteres (including chinese or other languages) and give me each word of the search:

Example if the search is :

harlan coben "harlan coben"

Each match found will be:

harlan
coben
"Harlan coben"

Here is the used code:

Pattern PATTERN = Pattern.compile("(?>\"[^\"]+\"+)|(?>[^ ]+)+");
Matcher match = PATTERN.matcher(motRecherche);
match.reset();
int iM = 0;

while(match.find()){
    if(iM > 0){
        query.append("%20AND%20");
    }

    String utf_encoded = CommonUtils.escapeSolrQuery(match.group(0));
    query.append(":"+utf_encoded);
    iM++;
}

Another thing about SolR, it's needed to escape some special characters + - && || ! ( ) { } [ ] ^ " ~ * ? : \ , SolR provides a client class called ClientUtils and a method escapeQueryChars() that i change for me:

public static String escapeSolrQuery(String searchWord){

        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < searchWord.length(); i++) {
          char c = searchWord.charAt(i);

          if (c == '\\' || c == '+' || c == '-' || c == '!'  || c == '(' || c == ')' || c == ':'
            || c == '^' || c == '[' || c == ']' || c == '{' || c == '}' || c == '~'
            || c == '*' || c == '?' || c == '|' || c == '&'  || c == ';' || c == '/') 
          {
            sb.append('\\');
          }

          if(c == '\"' && !searchWord.matches("\"[^\"]+\"")){
              sb.append('\\');
          }
          sb.append(c);
        }
       return sb.toString();
    } 

Now it works fine :)

Upvotes: 0

fge
fge

Reputation: 121710

You can use a regex but it will be quite complicated; in this case you need an alternation. Here it is assumed that you only have letters in your search term:

("[a-z]+(?:\s+[a-z]+)+"|[a-z]+)(?:\s+|$)

(note that the alternation order is important here!)

Example:

public final class Bar
{
    private static final Pattern PATTERN = Pattern
        .compile("(\"[a-z]+(?:\\s+[a-z]+)+\"|[a-z]+)(?:\\s+|$)",
            Pattern.CASE_INSENSITIVE);

    public static void main(final String... args)
        throws IOException
    {
        tryAndMatch("\"Harlan Coben\"");
        tryAndMatch("shakespeare harlan coben");
        tryAndMatch("shakespeare \"harlan coben\" coben");
    }

    private static void tryAndMatch(final String input)
    {
        final Matcher m = PATTERN.matcher(input);

        System.out.printf("INPUT: -->%s<--\n", input);

        while (m.find())
            System.out.printf("Term -->%s<--\n", m.group(1));

        System.out.println("END INPUT");
    }
}

Now, as to substitution into URLs, be aware that URLEncoder is not made to encode URL components, it is made to encode application/x-www-form-urlencoded data, in which a space becomes + and which does not have the same escape chararcter sets than neither a URI path or fragment.

The most accurate solution would be to use URI templates. This allows you to write templates such as:

http://my.site/?q={query}

where query is any Unicode string and this will encode it for you (self promotion: if you are interested I have a library to do that).

The second is to use Guava 15.0+, it has a set of escapers especially made for URLs.

Upvotes: 1

Related Questions