Reputation: 93

Processing data stored in Redshift

We're currently using Redshift as our data warehouse, which we're very happy with. However, we now have a requirement to do machine learning against the data in our warehouse. Given the volume of data involved, ideally I'd want to run the computation in the same location as the data rather than shifting the data around, but this doesn't seem possible with Redshift. I've looked at MADlib, but this is not an option as Redshift does not support UDFs (which MADlib requires). I'm currently looking at shifting the data over to EMR and processing it with the Apache Spark machine learning library (or maybe H20, or Mahout, or whatever). So my questions are:

is there a better way?
if not, how should I make the data accessible to Spark? The options I've identified so far include: use Sqoop to load it into HDFS, use DBInputFormat, do a Redshift export to S3 and have Spark grab it from there. What are the pros/cons for these different approaches (and any others) when using Spark?

Note that this is off-line batch learning, but we'd like to be able to do this as quickly as possible so that we can iterate experiments quickly.

Upvotes: 3

Answers (2)

Josh Rosen

Reputation: 13841

If you'd like to query Redshift data in Spark and you're using Spark 1.4.0 or newer, check out spark-redshift, a library which supports loading data from Redshift into Spark SQL DataFrames and saving DataFrames back to Redshift. If you're querying large volumes of data, this approach should perform better than JDBC because it will be able to unload and query the data in parallel. If you plan to run many different ML jobs on your Redshift data, then consider using spark-redshift to export it out of Redshift and save it to S3 in an efficient file format, such as Parquet.

Disclosure: I'm one of the authors of spark-redshift.

Upvotes: 2

Yuri Levinsky

Reputation: 1595

You can run Spark alongside your existing Hadoop cluster by just launching it as a separate service on the same machines. To access Hadoop data from Spark, just use a hdfs:// URL (typically hdfs://:9000/path, but you can find the right URL on your Hadoop Namenode’s web UI). Alternatively, you can set up a separate cluster for Spark, and still have it access HDFS over the network; this will be slower than disk-local access, but may not be a concern if you are still running in the same local area network (e.g. you place a few Spark machines on each rack that you have Hadoop on). You can use Data Pipeline service or just copy command to move data from Redshift to HDFS. Anyway you can use Redshift for machine learning, depends on tool your using or algorithm you implementing. Anyway It's less data base and more data store with all pros&cons behind it.

Upvotes: 0

Processing data stored in Redshift

Answers (2)

Related Questions