analyzed field vs doc_values: true field

Question

We have an elasticsearch that contains over half a billion documents that each have a url field that stores a URL.

The url field mapping currently has the settings:

{
    index: not_analyzed
    doc_values: true
    ...
}

We want our users to be able to search URLs, or portions of URLs without having to use wildcards. For example, taking the URL with path: /part1/user@site/part2/part3.ext

They should be able to bring back a matching document by searching:

part3.ext
user@site
part1
part2/part3.ext

The way I see it, we have two options:

Implement an analysed version of this field (which can no longer have doc_values: true) and do match querying instead of wildcards. This would also require using a custom analyser to leverage the pattern tokeniser to make the extracted terms correct (the standard tokeniser would split user@site into user and site).
Go through our database and for each document create a new field that is a list of URL parts. This field could have doc_values: true still so would be stored off-heap, and we could do term querying on exact field values instead of wildcards.

My question is this:

Which is better for performance: having a list of variable lengths that has doc_values on, or having an analysed field? (ie: option 1 or option 2) OR is there an option 3 that would be even better yet?!

Thanks for your help!

ndtreviv · Accepted Answer

It seems that no-one has actually performance tested the two options, so I did.

I took a sample of 10 million documents and created two new indices:

An index with an analysed field that was setup as suggested in the other answer.
An index with a string field that would store all permutations of URL segmentation.

I ran an enrichment process over the second index to populate the fields. The field values on the first index were created when I re-indexed the sample data from my main index.

Then I created a set of gatling tests to run against the indices and compared the gatling results and netdata (https://github.com/firehol/netdata) landscape for each.

The results were as follows:

Regarding the netadata landscape: The analysed field showed a spike - although only a small one - on all elastic nodes. The not_analysed list field tests didn't even register.

It is worth mentioning that enriching the list field with URL segmentation permutations bloated the index by about 80% in our case. So there's a trade off - you never need to do wildcard searches for exact sub-segment matching on URLs, but you'll need a lot more disk to do it.

Update

Don't do this. Go for doc_values. Doing anything with analyzed strings that have a massive number of possible terms will mean massive field data that will, eventually, never fit in the amount of memory you can allocate it.

analyzed field vs doc_values: true field

Answers (2)

Update

Related Questions