Reputation: 2797
In my ElasticHQ mapping:
@timestamp date yyyy-MM-dd HH:mm:ssZZZ
...
date date yyyy-MM-dd HH:mm:ssZZZ
In the above I have two types of date field each with a mapping to the same format.
In the data:
"@timestamp": "2014-05-21 23:22:47UTC"
....
"date": "2014-05-22 05:08:09-0400",
As above, the date format does not map to what ES thinks I have my dates formatted as. I assume something hinky happened at index time (I wasn't around).
Also interesting: When using a filtered range query like the following, I get a Parsing Exception explaining that my date is too short:
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29",
"to": "2015-11-23 07:00:29",
"time_zone": "+04:00"
}
}
}
}
}
}
But searching with the following passes ES's error check, but returns no results, I assume because of the date formatting in the documents.
GET _search
{
"query": {
"filtered": {
"query": {
"match_all": {}
},
"filter": {
"range": {
"date": {
"from": "2013-11-23 07:00:29UTC",
"to": "2015-11-23 07:00:29UTC",
"time_zone": "+04:00"
}
}
}
}
}
}
My question is this: given the above, is there any way we can avoid having to Re-Index and change the mapping and continue to search the malformed data? WE have around 1TB of data in this particular cluster, and would like to keep it as is, for obvious reasons.
Also attempted was a query that adheres to what is in the Data:
"query": {
"range": {
"date": {
"gte": "2014-05-22 05:08:09-0400",
"to": "2015-05-22 05:08:09-0400"
}
}
}
Upvotes: 1
Views: 2679
Reputation: 217544
The dates you have in your documents actually do conform to the date format you have in your mapping, i.e. yyyy-MM-dd HH:mm:ssZZZ
In date format patterns, ZZZ
stands for an RFC 822 time zone (e.g. -04:00, +04:00, EST, UTC, GMT, ...) so the dates you have in your data do comply otherwise they wouldn't have been indexed in the first place.
However, the best practice is to always make sure dates are transformed to UTC (or any other time zone common to the whole document base that makes sense in your context) before indexing them so that you have a common basis to query on.
As for your query that triggers errors, 2013-11-23 07:00:29
doesn't comply with the date format since the time zone is missing at the end. As you've rightly discovered, adding UTC
at the end fixes the query parsing problem (i.e. the missing ZZZ
part), but you might still get no results.
Now to answer your question, you have two main tasks to do:
1TB is a lot of data to reindex for fixing one or two fields. I don't know how your documents look like, but it doesn't really matter. The way I would approach the problem would be to run a partial update on all documents, and for this, I see two different solutions, in both of which the idea is to just fix the @timestamp
and date
fields:
Given the amount of data you have, solution 2 seems more appropriate.
So... your adhoc script should first issue a scroll query to obtain a scroll id like this:
curl -XGET 'server:9200/your_index/_search?search_type=scan&scroll=1m' -d '{
"query": { "match_all": {}},
"size": 1000
}'
As a result, you'll get a scroll id that you can now use to iterate over all your data with
curl -XGET 'server:9200/_search/scroll?_source=date,@timestamp&scroll=1m' -d 'your_scroll_id'
You'll get 1000 hits (you can de/increase the size
parameter in the first query above depending on your mileage) that you can now iterate over.
For each hit you get, you'll only have your two date fields that you need to fix. Then you can transform your dates into the standard timezone of your choosing using a solution like this for instance.
Finally, you can send your 1000 updated partial documents in one bulk like this:
curl -XPOST server:9200/_bulk -d '
{ "update" : {"_id" : "1", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2013-11-23 07:00:29Z", "@timestamp": "2013-11-23 07:00:29Z"} }
{ "update" : {"_id" : "2", "_type" : "your_type", "_index" : "your_index"} }
{ "doc" : {"date" : "2014-09-12 06:00:29Z", "@timestamp": "2014-09-12 06:00:29Z"} }
...
'
Rinse and repeat with the next iteration...
I hope this should give you some initial pointers to get started. Let us know if you have any questions.
Upvotes: 1