Reputation: 11
I am new to Cassandra, so I may be missing something. My goal is to insert 500,000 rows as quickly as possible, using Java (DataStax driver). It is currently inserting only 400 records per second, and the full 500,000 inserts is taking many minutes to fully execute. Duplicates in the ArrayList are possible, so the insert process should do an insert/update statement (in other words, the java list might contain duplicates, but the db table should contain only distinct values).
A select-query returns the 500k records in less than 1 second from cassandra, but the insert into cassandra takes a really long time. I am hoping the insert of 500k records could be less than 10 seconds. What can I do to get the inserts to be much faster?
Here is a definition for the Cassandra table:
create table mykeyspace.mytablename
(
my_id_record text primary key
);
Here is the java insert (just relevant code shown, any error handling removes for simplicity):
String insertCQL = "INSERT INTO mykeyspace.mytablename(my_id_record) VALUES (?);";
PreparedStatement insertPrepStmnt = session.prepare(insertCQL);
for( String myId: myArrayList) {
cassandraConnect.session.execute(insertPrepStmnt.bind(myId));
}
As you can see, it's inserting 500,00 records of a string value into a table with a single field (the primary key field).
Is 400 inserts per second the expected speed for Cassandra?
Any suggestions for what I can do to speed it up would be greatly appreciated.
Upvotes: 1
Views: 875
Reputation: 87329
You are using synchronous API - this means that you wait for answer before inserting next record. You can get much better throughput by using asynchronous API, but you need to control how many requests per connection is in-flight at the same time. You may need to control/tune connection pooling for that.
But if you really want to load data from files, such as CSV or JSON, the I recommend to look to DSBulk. If you want just generate test data - use NoSQLBench. Both tools are heavily optimized for maximum throughput.
Upvotes: 1