Count on RDD giving different results

Question

I have a an RDD with more than 75 million rows and when I call count function on it, I get a different number everytime. My understanding was count is supposed to give the exact number.

Edit

Just to give an idea of the data, the structure is something like this

Userid: 1  
Date: 8/15/2015  
Location: Building 1  
...  
Date 8/1/2015  
Location: Building 5  
...  

Userid: 2  
Date: 7/30/2015  
Location: Building 10 
...
Date: 6/1/2015  
Location: Building 3 
...

Partition key: Userid
Clustering key: Date ORDER BY DESC

Spark version: 1.2.2
Data is from Cassandra
API used is Scala
Spark Cassandra connector version 1.2.2
I have just read the data from Cassandra and used map to get just the Userid and Location.

sourabh0612 · Accepted Answer

I was using read consistency level Local_One and using Quorum consistency resolved the issue. The underlying issue was that we had a high mutation drop count for one of our nodes.

Count on RDD giving different results

Answers (1)

Related Questions