Reputation: 27
I have a webapp that could execute a search through an url query on SolR.
The results are received as a Document object.
my query is look like : q=Book:Harlan AND Book:Coben AND ..
, it works fine.
String[] word = searchedWord.trim().split(" ");
for (int i = 0; i < word.length; i++) {
if (!StringUtils.isEmpty(word[i])) {
if (i > 0) {
query.append("%20AND%20");
}
String utf_encoded = URLEncoder.encode(StringEscapeUtils.escapeJava(word[i]), "UTF-8");
}
}
But i need to enforce the kind of searched terms, because when the searched term is like an exact term : "Harlan Coben"
, this code split it in two word "Harlan
and Coben"
Per example, my webapp should be able to search:
Exact terms: "Harlan Coben"
Multiple terms: shakespeare harlan coben
Multiple mixed terms: shakespeare "harlan coben" coben
or shakespear "harlan coben"
or "harlan coben" coben
The URL to call SolR is encoded in UTF-8 to replace special characters..
How should i proceed ? by regular expressions ? or is there another way ?
------ EDIT --------
To be more specific, All of these characters could be "@(!ùéàç" or chinese/russian or anything else characters (unicode?) from a specific language.
I need to match them and separate them to prepare the SolR query.
Example:
If the search term is : coben "Harlan Coben" s(554603)hakesdpeare Straße Привет
My regex should match and give me this result:
coben
"Harlan Coben"
s(554603)hakesdpeare
Straße
Привет
Then i need to concatenate each of them with AND Book:
or juste Book:
to have a query as below:
q=Book:coben AND Book:"Harlan Coben" AND Book:s(554603)hakesdpeare AND Book:Straße AND Book:Привет
I tried ("[a-z]+(?:\s+[a-z]+)+"|[a-z]+)(?:\s+|$)
from @fge (thanks for that), but it match only with [a-z], i tried this with \\p{all}
but didn't work..
Any idea ?
------ END EDIT --------
Thanks for help !
Upvotes: 0
Views: 348
Reputation: 27
I finally found the correct regex to match any characteres (including chinese or other languages) and give me each word of the search:
Example if the search is :
harlan coben "harlan coben"
Each match found will be:
harlan
coben
"Harlan coben"
Here is the used code:
Pattern PATTERN = Pattern.compile("(?>\"[^\"]+\"+)|(?>[^ ]+)+");
Matcher match = PATTERN.matcher(motRecherche);
match.reset();
int iM = 0;
while(match.find()){
if(iM > 0){
query.append("%20AND%20");
}
String utf_encoded = CommonUtils.escapeSolrQuery(match.group(0));
query.append(":"+utf_encoded);
iM++;
}
Another thing about SolR, it's needed to escape some special characters + - && || ! ( ) { } [ ] ^ " ~ * ? : \ , SolR provides a client class called ClientUtils and a method escapeQueryChars() that i change for me:
public static String escapeSolrQuery(String searchWord){
StringBuilder sb = new StringBuilder();
for (int i = 0; i < searchWord.length(); i++) {
char c = searchWord.charAt(i);
if (c == '\\' || c == '+' || c == '-' || c == '!' || c == '(' || c == ')' || c == ':'
|| c == '^' || c == '[' || c == ']' || c == '{' || c == '}' || c == '~'
|| c == '*' || c == '?' || c == '|' || c == '&' || c == ';' || c == '/')
{
sb.append('\\');
}
if(c == '\"' && !searchWord.matches("\"[^\"]+\"")){
sb.append('\\');
}
sb.append(c);
}
return sb.toString();
}
Now it works fine :)
Upvotes: 0
Reputation: 121710
You can use a regex but it will be quite complicated; in this case you need an alternation. Here it is assumed that you only have letters in your search term:
("[a-z]+(?:\s+[a-z]+)+"|[a-z]+)(?:\s+|$)
(note that the alternation order is important here!)
Example:
public final class Bar
{
private static final Pattern PATTERN = Pattern
.compile("(\"[a-z]+(?:\\s+[a-z]+)+\"|[a-z]+)(?:\\s+|$)",
Pattern.CASE_INSENSITIVE);
public static void main(final String... args)
throws IOException
{
tryAndMatch("\"Harlan Coben\"");
tryAndMatch("shakespeare harlan coben");
tryAndMatch("shakespeare \"harlan coben\" coben");
}
private static void tryAndMatch(final String input)
{
final Matcher m = PATTERN.matcher(input);
System.out.printf("INPUT: -->%s<--\n", input);
while (m.find())
System.out.printf("Term -->%s<--\n", m.group(1));
System.out.println("END INPUT");
}
}
Now, as to substitution into URLs, be aware that URLEncoder
is not made to encode URL components, it is made to encode application/x-www-form-urlencoded
data, in which a space becomes +
and which does not have the same escape chararcter sets than neither a URI path or fragment.
The most accurate solution would be to use URI templates. This allows you to write templates such as:
http://my.site/?q={query}
where query
is any Unicode string and this will encode it for you (self promotion: if you are interested I have a library to do that).
The second is to use Guava 15.0+, it has a set of escapers especially made for URLs.
Upvotes: 1