Proliges
Proliges

Reputation: 361

Check if data already exists on a lot of data

I get csv files, read these files and write them to Cassandra. I do this for alot of data (roughly 10 million lines per day) The files itself are fairly small (from 100 to 1000 lines)

What I want to do is checking before i write them to the database, if the primary key I'm about to insert, already exists.

I know I can do it with Select count(*) from table where primary key1 = something and key2 is something else.

But this is slow, I want to check for an entire file in 1 go if its going to effect data that is already in Cassandra, and I want(need) it to be fast. Is there a way to achieve what I want? (or something similar, like checking per batch if its going to affect rows)

Upvotes: 0

Views: 1577

Answers (1)

Citrullin
Citrullin

Reputation: 2321

You can use IF NOT EXIST in INSERT Statements and IF EXIST in UPDATE Statements. The performance is better than counting all rows but, in compare to insert, without checking, slow. Cassandra has to check all nodes for existing primary keys.

Documentation for INSERT: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/insert_r.html

and for UPDATE: https://docs.datastax.com/en/cql/3.1/cql/cql_reference/update_r.html

Upvotes: 1

Related Questions