Importing blob data from RDBMS (Sybase) to Cassandra

Question

I am trying to import large blob data ( around 10 TB ) from an RDBMS (Sybase ASE) into Cassandra, using DataStax Enterprise(DSE) 5.0 .

Is sqoop still the recommended way to do this in DSE 5.0? As per the release notes(http://docs.datastax.com/en/latest-dse/datastax_enterprise/RNdse.html) :

Hadoop and Sqoop are deprecated. Use Spark instead. (DSP-7848)

So should I use Spark SQL with JDBC data source to load data from Sybase, and then save the data frame to a Cassandra table?

Is there a better way to do this? Any help/suggestions will be appreciated.

Edit: As per DSE documentation (http://docs.datastax.com/en/latest-dse/datastax_enterprise/spark/sparkIntro.html), writing to blob columns from spark is not supported.

The following Spark features and APIs are not supported:

Writing to blob columns from Spark

Reading columns of all types is supported; however, you must convert collections of blobs to byte arrays before serialising.

Brad Schoening · Accepted Answer

Spark for the ETL of large data sets is preferred because it performs a distributed injest. Oracle data can be loaded into Spark RDDs or data frames and then just use saveToCassandra(keyspace, tablename). Cassandra Summit 2016 had a presentation Using Spark to Load Oracle Data into Cassandra by Jim Hatcher which discusses this topic in depth and provides examples.

Sqoop is deprecated but should still work in DSE 5.0. If its a one-time load and you're already confortable with Squoop, try that.

Importing blob data from RDBMS (Sybase) to Cassandra

Answers (1)

Related Questions