Zach B
Zach B

Reputation: 97

Scala Spark Cassandra update or insert rows on primary key match

I am migrating data from csv SQL files (1 per table) to a Cassandra database that is using a pre-determined and standardized format. As a result, I am doing transformations, joins, etc on the SQL data to get it matching this format before writing it to Cassandra. My issue is that this db migration is happening in batches (not all at once) and I cannot ensure that information from the multiple sides of a table join will be present when an entry to Cassandra is written.

ex. Table 1 and table 2 both have the partitioning and clustering keys (allowing the join since their combination is unique) and are joined using full outer join. With the way that we are being given data, however, there is a chance that we could get a record from Table 1 but not from Table 2 in a "batch" of data. When I perform the full outer join, no problems...extra columns from the other table are added and just fill with nulls. On the next interval that I get data, I then receive the Table 2 portion that should have previously been joined to Table 1.

How do I get those entries combined?

I have looked for an update or insert type method in Spark depending if that set of partitioning and clustering keys exists but have not turned up anything. Is this the most efficient way? Will I just have to add every entry with spark.sql query then update/write?

Note: using uuids that would prevent the primary key conflict will not solve the issue, I do not want 2 partial entries. All data with that particular primary key needs to be in the same row.

Thanks for any help that you can provide!

Upvotes: 2

Views: 640

Answers (1)

Joe K
Joe K

Reputation: 18424

I think you should be able to just directly write the data to cassandra and not have to worry about it, assuming all primary keys are the same.

Cassandra's inserts are really "insert or update" so I believe when you insert one side of a join, it will just leave some columns empty. Then when you insert the other side of the join, it will update that row with the new columns.

Take this with a grain of salt, as I don't have a Spark+Cassandra cluster available to test and make sure.

Upvotes: 2

Related Questions