Reputation: 2219
I have a Python application, built with Flask, that allows importing of many data records (anywhere from 10k-250k+ records at one time). Right now it inserts into a Cassandra database, by inserting one record at a time like this:
for transaction in transactions:
self.transaction_table.insert_record(transaction)
This process is incredibly slow. Is there a best-practice approach I could use to more efficiently insert this bulk data?
Upvotes: 0
Views: 3298
Reputation: 1179
The easiest solution is to generate csv files from your data, and import it with the COPY command. That should work well for up to a few million rows. For more complicated scenarios you could use the sstableloader command.
Upvotes: 1
Reputation: 139
You can use batch statements for this, an example and documentation is available from the datastax documentation. You can also use some child workers and/or async queries on top of this.
In terms of best practices, it is more efficient if each batch only contains one partition key. This is because you do not want a node to be used as a coordinator for many different partition keys, it would be faster to contact each individual node directly.
If each record has a different partition key, a single prepared statement with some child workers may work out to be better.
You may also want to consider using a TokenAware load balancing policy allowing the relevant node to be contacted directly, instead of being coordinated through another node.
Upvotes: 1