ndtreviv
ndtreviv

Reputation: 3624

analyzed field vs doc_values: true field

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.

The url field mapping currently has the settings:

{
    index: not_analyzed
    doc_values: true
    ...
}

We want our users to be able to search URLs, or portions of URLs without having to use wildcards. For example, taking the URL with path: /part1/user@site/part2/part3.ext

They should be able to bring back a matching document by searching:

The way I see it, we have two options:

  1. Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user@site into user and site).
  2. Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.

My question is this:

Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!

Thanks for your help!

Upvotes: 0

Views: 99

Answers (2)

ndtreviv
ndtreviv

Reputation: 3624

It seems that no-one has actually performance tested the two options, so I did.

I took a sample of 10 million documents and created two new indices:

  1. An index with an analysed field that was setup as suggested in the other answer.
  2. An index with a string field that would store all permutations of URL segmentation.

I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.

Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.

The results were as follows:

Quantile Comparisons of analysed vs not analysed list

Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.

It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.

Update

Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

Upvotes: 1

Karsten R.
Karsten R.

Reputation: 1758

Your question is about a field where you need doc_values but can not index with keyword-analyzer.

You did not mention why you need doc_values. But you did mention that you currently not search in this field. So I guess that the name of the search-field do not have to be the same: you can copy the field value in an other field which is only for search ( "store": false ). For this new field you can use the pattern-analyzer or pattern-tokenizer for your use case.

Upvotes: 1

Related Questions