Ben Harrison
Ben Harrison

Reputation: 2219

How to efficiently insert bulk data into Cassandra using Python?

I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:

for transaction in transactions:
    self.transaction_table.insert_record(transaction)

This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?

Upvotes: 0

Views: 3298

Answers (2)

medvekoma
medvekoma

Reputation: 1179

The easiest solution is to generate csv files from your data, and import it with the COPY command. That should work well for up to a few million rows. For more complicated scenarios you could use the sstableloader command.

Upvotes: 1

Samyel
Samyel

Reputation: 139

You can use batch statements for this, an example and documentation is available from the datastax documentation. You can also use some child workers and/or async queries on top of this.

In terms of best practices, it is more efficient if each batch only contains one partition key. This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.

If each record has a different partition key, a single prepared statement with some child workers may work out to be better.

You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.

Upvotes: 1

Related Questions