Cassandra is missing data when loading a csv with cassandra-loader

Question

I use Cassandra 3.11.3 with two nodes on Ubuntu 16.04. The keyspace and table I will use here are:

## Create a keyspace
CREATE KEYSPACE sto
WITH REPLICATION = { 
'class' : 'SimpleStrategy', 
'replication_factor' : 1 
} ;
## Create a table
CREATE TABLE sto.cartespuce_numligne_date (
numcarteserie text,
codetypetitre int,
typetransaction int,
heuretransaction float,
numservice int,
numligne text,
direction text,
heureligne float,
numjour text,
numarret text,
numbus int,
date date,
PRIMARY KEY (numligne, date) 
) WITH CLUSTERING ORDER BY (date DESC);

I upload a small dataset of 50,000 rows to this table

numligne,date,codetypetitre,direction,heureligne,heuretransaction,numarret,numbus,numcarteserie,numjour,numservice,typetransaction
33,2017-12-07,144,Nord,13.88,15.27,2190,808,1229320749340288,1,268,2
749,2017-12-08,144,Nord,6.93,7.35,1459,507,1229320749340288,1,548,1

using cassandra-loader https://github.com/brianmhess/cassandra-loader

I could use the CQL copy, but this is a preliminary test for further loadings where I will need cassandra-loader.

I load the csv file data.csv:

cassandra-loader -f data.csv -host my-ip-address -schema "sto.cartespuce_numligne_date(numligne,date,codetypetitre,direction,heureligne,heuretransaction,numarret,numbus,numcarteserie,numjour,numservice,typetransaction)"

The processing runs smoothly, it ends with the following log:

*** DONE: data.csv  number of lines processed: 50000 (50000 inserted)

But when I count the rows with CQL:

cqlsh> SELECT COUNT(*) FROM sto.cartespuce_numligne_date;

count
-------
9877

comparing particular cases, it is clear that data is missing in the database. I see no difference between the data stored and the data missed.

How can I loose 80% of my data?

Horia · Accepted Answer

The primary key of your table is numligne, date.

Since the data in your csv file is not unique according to the same primary key, even if you do inserts, cassandra just updates those entries.

To give you an example if at line 43 you have the combination 33,2017-12-07,...this will be inserted. If at line 2000 you have the same combination, when this insert will be run, Cassandra will actually do an update, since that key is already in the database.

Both INSERT and UPDATE operations are upsert operation. Some further reading about INSERT and UPDATE commands.

In order to avoid this you could define another primary key so each line would have a unique key or you could write your own loader that would insert using IF NOT EXISTS so it inserts rows only if they don't exist (see the link for INSERT command, paragraph Inserting a row only if it does not already exist).

Cassandra provides its own COPY command, but

The process verifies the PRIMARY KEY and updates existing records.

After checking the code of the tool that you are using, I can see that the INSERT command being used there is not using IF NOT EXISTS so it will also update if the key already exists.

Cassandra is missing data when loading a csv with cassandra-loader

Answers (1)

Related Questions