Reputation: 2017
I want to insert around 50 million rows ( ~ 30 columns each) into cassandra, currently only have 1 node.
I am querying my data from another data source and store in a table object. I iterate through parse each of the row individually then add it to the mutator. Currently I am inserting 100 rows at a time and 1 million rows takes 40 minutes! How do I speed up this process? ( I have also tried client.batch_mutate() but it seems to have reset connection error after a few thousand inserts of blocksize 2).
Through searching around I see that multi-threading could help. But I could not find any examples, could someone link me? thank you !!
My current code:
List<String> colNames = new ArrayList<String>();
List<String> colValues = new ArrayList<String>();
SomeTable result = Query(...); // this contains my result set of 1M rows initially
for (Iterator itr = result.getRecordIterator(); itr.hasNext();) {
String colName =.....
String colValue = .....
int colCount = colNames.size(); // 100 * 30
for (int i = 0; i < colCount; i++) {
//add row keys and columns to mutator
mutator.addInsertion(String.valueOf(rowCounter), "data", HFactory.createStringColumn(colNames.get(i), colValues.get(i)));
}
rowCounter++;
//insert rows of block size 100
if (rowCounter % 100==0) {
mutator.execute();
//clear data
colNames = new ArrayList<String>();
colValues = new ArrayList<String>();
mutator = HFactory.createMutator(keyspace, stringSerializer);
}
}
Upvotes: 3
Views: 3786
Reputation: 11100
Multithreading will help a lot, yes. At the moment, you are using one connection in Cassandra which means you are only using a single thread inside Cassandra. You need to use multiple connections, which requires multiple threads in your client.
One way would be to use a Java ThreadPoolExecutor and wrap your mutator.execute() in a runnable and execute it on the thread pool. Take care to handle exceptions. You should also use a BlockingQueue to limit the number of queued mutations in case you read off your source faster than Cassandra can insert.
With this, set your connection pool size in Hector to something like 10 and your inserts should be significantly faster.
On a side note in case you weren't aware, Cassandra isn't designed for single node operation. I assume you are intending to scale and add replication. If not then you will probably find an alternative solution more performant and simpler for your needs. The multiple connections and threads become especially important when using multiple nodes so your insert rate can scale.
Upvotes: 2