Reputation: 2571
I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.
Csv structure(without header):
id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2
to load .csv I typed:
lines = sc.textFile("test.txt")
then a distinct count on lines
returned 3 as expected:
lines.distinct().count()
But I have no idea how to make a distinct count based on lets say id
and country
.
Upvotes: 7
Views: 26643
Reputation: 39
The split line can be optimized as follows:
sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()
Upvotes: 2
Reputation: 40993
In this case you would select the columns you want to consider, and then count:
sc.textFile("test.txt")\
.map(lambda line: (line.split(',')[0], line.split(',')[1]))\
.distinct()\
.count()
This is for clarity, you can optimize the lambda to avoid calling line.split
two times.
Upvotes: 8