How to efficiently insert bulk data into Cassandra using Python?

Question

I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:

for transaction in transactions:
    self.transaction_table.insert_record(transaction)

This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?

Samyel · Accepted Answer

You can use batch statements for this, an example and documentation is available from the datastax documentation. You can also use some child workers and/or async queries on top of this.

In terms of best practices, it is more efficient if each batch only contains one partition key. This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.

If each record has a different partition key, a single prepared statement with some child workers may work out to be better.

You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.

How to efficiently insert bulk data into Cassandra using Python?

Answers (2)

Related Questions