Reputation: 2039
In one of my fields in Elasticsearch I'm storing the URL of my documents (e.g. http://techcrunch.com/something-great
)
When I don't escape the URL, the document is found correctly - but I get the EOF error on some URLs.
When I escape the URL with:
String escapedString = QueryParser.escape(e.getKey().getUrl());
The document is not found - I get zero hits.
So how to do it?
{
_index: "crawlbot",
_type: "article",
_id: "AVFaaFu4w49jUzVInKS5",
_score: 1,
_source: {
job: {
id: 65,
name: "wikipedia_en",
max_pages: 300000,
crawl_depth: 0,
processing_patterns: "-Category,-User,-Wikipedia:,-Topic,-Special:,-Talk:,-Portal:,-MOS",
status: 0,
days: 0,
url: [
"https://en.wikipedia.org"
],
ajax: false,
min_description: 0
},
article: {
url: "https://en.wikipedia.org/w/index.php?action=history&feed=atom&title=Parliament_of_Romania",
provider_url: "https://en.wikipedia.org",
provider_name: "",
provider_display: "en.wikipedia.org",
favicon_url: "http://www.google.com/s2/u/0/favicons?domain=https://en.wikipedia.org",
language: "en",
metadata: {
authors: []
},
entities: [],
keywords: [],
videos: [],
unfilteredKeywords: [],
published: "",
published_long: 0
}
}
}
And i would like the to retrieve the document per article.url
This is the query:
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
String escapedString = QueryParser.escape(e.getKey().getUrl());
queryBuilder.must(QueryBuilders.queryStringQuery(escapedString).defaultField("article.url"));
queryBuilder.must(QueryBuilders.queryStringQuery(e.getKey().getJob().getId() + "").defaultField("job.id"));
Error if i don't escape:
Exception in thread "main" org.elasticsearch.action.search.SearchPhaseExecutionException: Failed to execute phase [query], all shards failed; shardFailures {[9_T8APppReyWKppSNZWmXw][crawlbot][0]: SearchParseException[[crawlbot][0]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][1]: SearchParseException[[crawlbot][1]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][2]: SearchParseException[[crawlbot][2]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][3]: SearchParseException[[crawlbot][3]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }{[9_T8APppReyWKppSNZWmXw][crawlbot][4]: SearchParseException[[crawlbot][4]: from[-1],size[-1]: Parse Failure [Failed to parse source [{"query":{"bool":{"must":[{"query_string":{"query":"http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4","default_field":"article.url"}},{"query_string":{"query":"70","default_field":"job.id"}}]}}}]]]; nested: QueryParsingException[[crawlbot] Failed to parse query [http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4]]; nested: ParseException[Cannot parse 'http://www.zeit.de/wirtschaft/2015-11/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4': Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; nested: TokenMgrError[Lexical error at line 1, column 111. Encountered: <EOF> after : "/griechenland-reparationszahlung-bayern-rentner-jahresrueckblick-2?page=4"]; }
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction.onFirstPhaseResult(TransportSearchTypeAction.java:237)
at org.elasticsearch.action.search.type.TransportSearchTypeAction$BaseAsyncAction$1.onFailure(TransportSearchTypeAction.java:183)
at org.elasticsearch.search.action.SearchServiceTransportAction$23.run(SearchServiceTransportAction.java:565)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Upvotes: 0
Views: 1850
Reputation: 217294
I suggest you change the mapping of your article.url
field to:
url: {
"type": "string",
"index": "not_analyzed"
}
Failing to do so will make your field analyzed and very hard to query given the way the standard analyzer will break up the URL into several tokens.
Then, instead of using a query_string
query, you can use a term
query in order to query your documents.
SearchRequestBuilder requestBuilder = client.prepareSearch("crawlbot").setSearchType(SearchType.DFS_QUERY_THEN_FETCH);
BoolQueryBuilder queryBuilder = new BoolQueryBuilder();
queryBuilder.must(QueryBuilders.termQuery("article.url", e.getKey().getUrl()));
... ^
|
use a term query instead
UPDATE
Following up on Evaldas' comment (kudos Evaldas!), in the end the idea is to create a custom analyzer in order to make sure that the URL will be lowercased as well.
When creating your index, you can add a new analyzer in the settings
and then use it as the analyzer of your article.url
field:
PUT /crawlbot
{
"settings": {
"analysis": {
"analyzer": {
"url_analyzer": {
"type": "custom",
"tokenizer": "keyword",
"filter": [ "lowercase" ]
}
}
}
},
"mappings": {
"article": {
"properties": {
"article": {
"url": {
"type": "string",
"analyzer": "url_analyzer"
}
}
}
}
}
}
Upvotes: 2