Reputation: 93
I have a an RDD with more than 75 million rows and when I call count
function on it, I get a different number everytime. My understanding was count is supposed to give the exact number.
Edit
Just to give an idea of the data, the structure is something like this
Userid: 1
Date: 8/15/2015
Location: Building 1
...
Date 8/1/2015
Location: Building 5
...
Userid: 2
Date: 7/30/2015
Location: Building 10
...
Date: 6/1/2015
Location: Building 3
...
Partition key: Userid
Clustering key: Date
ORDER BY DESC
Spark version: 1.2.2
Data is from Cassandra
API used is Scala
Spark Cassandra connector version 1.2.2
I have just read the data from Cassandra and used map to get just the Userid
and Location
.
Upvotes: 3
Views: 925
Reputation: 93
I was using read consistency level Local_One and using Quorum consistency resolved the issue. The underlying issue was that we had a high mutation drop count for one of our nodes.
Upvotes: 2