Reputation: 1134

Mongodb to redshift

We have a few collections in mongodb that we wish to transfer to redshift (on an automatic incremental daily basis). How can we do it? Should we export the mongo to csv?

Upvotes: 12

Answers (6)

Adeel Nazir

Reputation: 49

Stitch Data is the best tool ever I've ever seen to replicate incrementally from MongoDB to Redshift within a few clicks and minutes.

Automatically and dynamically Detect DML, DDL for tables for replication.

Upvotes: 0

faisal_kk

Reputation: 1073

AWS Database Migration Service(DMS) Adds Support for MongoDB and Amazon DynamoDB.So I think now onward best option to migrate from MongoDB to Redshift is DMS.

MongoDB versions 2.6.x and 3.x as a database source
Document Mode and Table Mode supported
Supports change data capture(CDC)

Details - http://docs.aws.amazon.com/dms/latest/userguide/CHAP_Source.MongoDB.html

Upvotes: 2

Sanatani

Reputation: 61

Honestly, I'd recommend using a third party here. I've used Panoply (panoply.io) and would recommend it. It'll take your mongo collections and flatten them into their own tables in redshift.

Upvotes: 5

Mark Butler

Reputation: 4391

I wrote some code to export data from Mixpanel into Redshift for a client. Initially the client was exporting to Mongo but we found Redshift offered very large performance improvements for query. So first of all we transferred the data out of Mongo into Redshift, and then we came up with a direct solution that transfers the data from Mixpanel to Redshift.

To store JSON data in Redshift first you need to create a SQL DDL to store the schema in Redshift i.e. a CREATE TABLE script.

You can use a tool like Variety to help as it can give you some insight into your Mongo schema. However it does struggle with big datasets - you might need to subsample your dataset.

Alternatively DDLgenerator can generate DDL from various sources including CSV or JSON. This also struggles with large datasets (well the dataset I was dealing with was 120GB).

So in theory you could use MongoExport to generate CSV or JSON from Mongo and then run it through DDL generator to get a DDL.

In practice I found using JSON export a little easier because you don't need to specify the fields you want to extract. You need to select the JSON array format. Specifically:

   mongoexport --db <your db> --collection <your_collection> --jsonArray > data.json
   head data.json > sample.json
   ddlgenerator postgresql sample.json

Here - because I am using head - I use a sample of the data to show the process works. However, if your database has schema variation, you want to compute the schema based on the whole database which could take several hours.

Next you upload the data into Redshift.

If you have exported JSON, you need to use Redshift's Copy from JSON feature. You need to define a JSONpath to do this.

For more information check out the Snowplow blog - they use JSONpaths to map the JSON on to a relational schema. See their blog post about why people might want to read JSON to Redshift.

Turning the JSON into columns allows much faster query than the other approaches such as using JSON EXTRACT PATH TEXT.

For incremental backups, it depends if data is being added or data is changing. For analytics, it's normally the former. The approach I used is to export the analytic data once a day, then copy it into Redshift in an incremental fashion.

Here are some related resources although in the end I did not use them:

Spotify has a open-source project called Luigi - this code claims to upload JSON to Redshift but I haven't used it so I don't know if it works.
Amiato have a web page that says they offer a commercial solution for loading JSON data into Redshift - but there is not much information beyond that.
This blog post discusses performing ETL on JSON datasources such as Mixpanel into Redshift.
Related Redit question
Blog post about dealing with JSON arrays in Redshift

Upvotes: 19

Meny Issakov

Reputation: 1421

I had to tackle the same issue (not on a daily basis though). as ask mentioned, You can use mongoexport in order to export the data, but keep in mind that redshift doesn't support arrays, so in case your collections data contains arrays you'll find it a bit problematic.

My solution to this was to pipe the mongoexport into a small utility program I wrote that transforms the mongoexport json rows into my desired csv output. piping the output also allows you to make the process parallel.

Mongoexport allows you to add a mongodb query to the command, so if your collection data supports it you can spawn N different mongoexport processes, pipe it's results into the other program and decrease the total runtime of the migration process.

Later on, I uploaded the files to S3, and performed a COPY into the relevant table.
This should be a pretty easy solution.

Upvotes: 0

aks

Reputation: 720

A few questions that would be helpful to know would be:

Is this an add-only always increasing incremental sync i.e. data is only being added and not being updated / removed or rather your redshift instance is interested only in additions?
Is the data inconsistency due to delete / updates happening at source and not being fed to redshift instance ok?
Does it need to be daily-incremental batch or can it be realtime as it is happening as well?

Depending on your situation may be mongoexport works for you, but you have to understand the shortcoming of it, which can be found at http://docs.mongodb.org/manual/reference/program/mongoexport/ .

Upvotes: 0

Mongodb to redshift

Answers (6)

Related Questions