Barry Kelly
Barry Kelly

Reputation: 42152

How should large SELECTs be done in Cassandra?

I'm investigating Cassandra as a possible alternative backing store for a data-intensive application, and I'm looking at ways to structure the schema and use CQL to do the kinds of queries we do today using MySQL.

The concrete problem I currently have is: I need to insert, say, 1 million rows into a table. However, if there already exists a row with the right identity (i.e. it's already in the system, identified by a hash), I want to reuse its id for relational reasons. But I only expect an overlap of, say, 10,000 IDs - but of course it could be all 1 million.

Suppose I have a table like this:

create table records_by_hash(hash text primary key, id bigint);

Is it enough to issue a select hash, id from records_by_hash where hash in (...) with all hashes in a multi-megabyte comma-separated list? Is this the best approach for Cassandra?

The way we do this in MySQL is like this:

create temporary table hashes(hash text);
-- file is actually JDBC OutputStream
load data infile '/dev/stdin' into table hashes -- csv format
select id, hash from records join hashes on records.hash = hashes.hash;

Since records is indexed on hashes, and the lookup data is now in MySQL (no more round trips), this is fairly quick and painless. load data is very fast, and there's only three logical round trips.

Upvotes: 0

Views: 255

Answers (2)

Shlomi Livne
Shlomi Livne

Reputation: 477

In MySQL the ID is usually an AUTO_INCREMENT - there is no parallel for this in Cassandra. Its not clear to me if you are looking to have cassandra create the ID(s) as well or have some other system / db create them for you.

Another thing to note is that MySQL INSERT INTO table (a,b,c) VALUES (1,2,3) ON DUPLICATE KEY UPDATE is parallel to cassandra CQL INSERT, that is Cassandra CQL INSERT will update the record if one exists.

You may want to model information in a different manner in Cassandra

Upvotes: 0

Marko Švaljek
Marko Švaljek

Reputation: 2101

Using the in operator is most of the time not the best idea because you are hitting multiple partitions (located on random nodes) within same query. It is slow and puts a lot of work on current coordinator node. It not a good idea to have multi megabyte list there.

Check before set is rarely good idea because it doesn't really scale. Also cassandra does not provide you with joins. Depending on your needs you would have to have some sort of script that would check all this before doing the inserts. So you would need check and set etc.

Also an alternative approach for this would be to use spark.

The thing is cassandra won't mind if the hash is already there and you insert some new stuff over it. But this is not something you actually need because you want to keep the references. One possible approach is also to use lightweight transactions so you can use IF NOT EXISTS to perform the insertion only if the row does not already exist. Using IF NOT EXISTS incurs a performance hit associated with using Paxos internally.

Upvotes: 0

Related Questions