user2022068
user2022068

Reputation:

Elasticsearch: Working with frequently updated documents

I have a forum application. There are multiple topics(posts) in the forums. Every topic has fields such as viewCount(how many times the topic was viewed by the forum users).

I want that all fields of topics were taken from ES (id, date, title, content and viewCount). However, in this case, after every topic view, ES must reindex the entire document again.
I asked the question about partial update at stack overflow - Partial update on field that is not indexed. It's important to notice that viewCount field is not indexed, it's just stored in ES.

There are two terms - partial update and partial index. There is partial update in ES where you can change only a few fields. But there is no partial reindex which means even if you change only one field, ES will reindex the entire document. It means that if the topic is viewed 1000 times, ES will index it 1000 times. And if I have a lot of users, many documents will be indexed again and again. This is first strategy.

The second strategy is to keep some fields of the topic in the index and some in the database. In this case, I can take viewAcount from DB. Also, I can then store all fields in the DB and use index only as INDEX i.e to get ids of current topic.

What is the best way to solve such problems?

Upvotes: 15

Views: 16380

Answers (3)

Slam
Slam

Reputation: 8572

For me, seems that in case of using ES, you should just update all data in index and query it against. If you will split text (as far, as I understand, you store topics in ES for text search) and "digital" data between datastores, you'll experience bigger performance hit, than in case of reindexing docs in ES.

The only thing ES can do with documents in indices - indexing and deleting. So, there are two ways to speedup reindexing

  • speedup "payload" - reduce time taken to remove document and to index it again. This can be achieved moving ES index to memory, to leverage Lucene's RamIndexStore

  • reduce network overhead - perform operations at ES side with scripts

btw, do you experience performance issues already?

Upvotes: 8

Grokify
Grokify

Reputation: 16334

Regarding Partial Update to Documents, it is important to recognize that while the API is letting you perform a partial update, behind the scenes, it performs a full update by retrieving the document, changing it and reindexing it. The below is from the Elasticsearch website:

Partial Updates to Documents

In Updating a Whole Document, we said that the way to update a document is to retrieve it, change it, and then reindex the whole document. This is true. However, using the update API, we can make partial updates like incrementing a counter in a single request.

We also said that documents are immutable: they cannot be changed, only replaced. The update API must obey the same rules. Externally, it appears as though we are partially updating a document in place. Internally, however, the update API simply manages the same retrieve-change-reindex process that we have already described. The difference is that this process happens within a shard, thus avoiding the network overhead of multiple requests. By reducing the time between the retrieve and reindex steps, we also reduce the likelihood of there being conflicting changes from other processes.

To both store the fulltext data in Elasticsearch and have fields that are changed often without reindexing the entire document, you will need to store those items elsewhere. This can be a metadata / counter store within another Elasticsearch index or another system.

For common use cases, you could run the same query against both and merge the results. These are most likely simple filters and sorts on fields that don't change, e.g. subject, creation time, author, etc.

For searches that won't match, such as full-text queries, you can either (a) not display that data, or (b) use an eventually consistent approach where you periodically update the Elasticsearch topic store with the updated counts. Many systems that don't have high consistency requirements can use the eventually consistency approach, including Stack Overflow, Netflix, etc. For example, on some sites, you'll get one count on one page / widget and another count on another page / widget due to the eventually consistent design.

Upvotes: 12

ManishKG
ManishKG

Reputation: 599

I guess the best approach is to rethink your index design. It might make sense to create another index, which has less number of fields and hence less index/update cost, which maps ids to their respective view counts. Your client side then can issue two queries to get all the required information.

Upvotes: 0

Related Questions