Reputation: 86727
I want to:
Postgres performs fine in general, means: 1-2 GB / min (using java).
Also selects are quite fine when I put an index
on my 4 paramters that are used for the select.
Still, it will take a "long" time when importing a few hundred GBs.
Question: could it be worthwhile to try the same using a noSQL
engine like apache Cassandra
?
I mean:
Upvotes: 1
Views: 1133
Reputation: 6495
Cassandra builds on query driven modelling. Since you know your query (and assuming you want equality checks for the four params), you should be able to get blistering query speeds if you model the data right.
Cassandra ingestion is also very fast. However, if you've got a lot of data, the usual approach is to transform that data to SSTables (possibly via some code) and import (which is extremely fast). If that's not feasible, you can do parallel async rights.
COPY is not really meant for large scale production usage. Either write an importer that uses the java client to do async writes with futures, or go the SSTable route. Another good alternative is to use Spark and the Spark Cassandra connector to forward CSV rdds to a cassandra table. Of course, you'll need a spark cluster for that to work (though depending on machine power / load you might get away with a single node spark standalone process - in which case what you gain is the simplicity). The spark code would look like:
sc.textFile("csv.csv").split(",").[...transforms..].saveToCassandra('ks', 'table');
Upvotes: 1