snegi
snegi

Reputation: 626

Query about Elasticsearch

I am writing a service that will be creating and managing user records. 100+ million of them. For each new user, service will generate a unique user id and write it in database. Database is sharded based on unique user id that gets generated.

Each user record has several fields. Now one of the requirement is that the service be able to search if there exists a user with a matching field value. So those fields are declared as index in database schema.

However since database is sharded based on primary key ( unique user id ). I will need to search on all shards to find a user record that matches a particular column.

So to make that lookup fast. One thing i am thinking of doing is setting up an ElasticSearch cluster. Service will write to the ES cluster every time it creates a new user record. ES cluster will index the user record based on the relevant fields.

My question is :

-- What kind of performance can i expect from ES here ? Assuming i have 100+million user records where 5 columns of each user record need to be indexed. I know it depends on hardware config as well. But please assume a well tuned hardware.

-- Here i am trying to use ES as a memcache alternative that provides multiple keys. So i want all dataset to be in memory and does not need to be durable. Is ES right tool to do that ?

Any comment/recommendation based on experience with ElasticSearch for large dataset is very much appreciated.

Upvotes: 0

Views: 194

Answers (1)

Bruce Ritchie
Bruce Ritchie

Reputation: 1035

ES is not explicitly designed to run completely in memory - you normally wouldn't want to do that with large unbounded datasets in a Java application (though you can using off-heap memory). Rather, it'll cache what it can and rely on the OS's disk cache for the rest.

100+ million records shouldn't be an issue at all even on a single machine. I run an index consisting 15 million records of ~100 small fields (no large text fields) amounting to 65Gb of data on disk on a single machine. Fairly complex queries that just return id/score execute in less than 500ms, queries that require loading the documents return in 1-1.5 seconds on a warmed up vm against a single SSD. I tend to given the JVM 12-16GB of memory - any more and I find it's just better to scale up via a cluster than a single huge vm.

Upvotes: 1

Related Questions