battery
battery

Reputation: 31

Automate indexing data into elasticsearch

I have a model whose flow is as follows

CMS --> Postgres --> Elasticsearch --> Querying from Elasticsearch --> Final Result

Everyday, new fields are added in CMS and the data is subsequently pushed into Elasticsearch. However, this takes up a lot of time, given the huge amount of data.

Could there be a way so that, every time a new entry is added to CMS, it simultaneously gets pushed to Elasticsearch, without manually prompting the application to do so?

I want to automate the process of pushing data into Elasticsearch from CMS. Any input would be welcome.

Also, I'm using elasticsearch-py as framework.

Upvotes: 2

Views: 2240

Answers (2)

Steve Smith
Steve Smith

Reputation: 388

To expand on Jayson's last point, if you really need 'realtime' updates from Postgres to Elasticsearch, you're going to need to use triggers and some system of pushing data from the DB to ES.

One method I have experimented with requires writing a dedicated Postgres->ES gateway and using some Postgres-specific features. I've written about it here: http://haltcondition.net/2014/04/realtime-postgres-elasticsearch/

The principle is actually pretty simple, complexity of the method I have come up with is due to handling corner cases such as multiple gateways running and gateways becoming unavailable for a while. In short my solution is:

  • Attach a trigger to all tables of interest that copies the updated row IDs to a temporary table.
  • The trigger also emits an async notification that a row has been updated.
  • A separate gateway (mine is written in Clojure) attaches to the Postgres server and listens for notifications. This is the tricky part, as not all Postgres client drivers support async notifications (there is a new experimental JDBC driver that does, which is what I use, it appears that psycopg also supports this).
  • On update the gateway reads, transforms and pushes the data to Elasticsearch.

In my experiments this model is capable of sub-second updates to Elasticsearch after a Postgres row insert/update. Obviously this will vary in the real world though.

There is a proof-of-concept project with Vagrant and Docker test frameworks here: https://bitbucket.org/tarkasteve/postgres-elasticsearch-realtime

Upvotes: 2

Jayson Minard
Jayson Minard

Reputation: 85936

The question is missing many details, but given you are going from Postgres into ElasticSearch you can use the JDBC "river" to either push or pull data along with incremental changes into ES. Rivers are plugins to ElasticSearch that manage data ingestion. The ElasticSearch JDBC river can be found at: https://github.com/jprante/elasticsearch-river-jdbc and there is a mentioned in this little tutorial: http://voormedia.com/blog/2014/06/four-ways-to-index-relational-data-in-elasticsearch

You can index a full load, and also monitor updates on a timer by either timestamp or sequence number. This basically sends the batches of changes at a given interval to ElasticSearch.

If you want to push changes as they happen, this is something you must code either in your CMS or as triggers within Postgres if they are able to issue HTTP calls out to the REST API of ElasticSearch (I don't know Postgres well enough to know if they can do that). But pushing data really isn't an ElasticSearch issue, it is something you must solve in your CMS or application that manages the data.

Upvotes: 2

Related Questions