dimzak
dimzak

Reputation: 2571

PySpark distinct().count() on a csv file

I'm new to spark and I'm trying to make a distinct().count() based on some fields of a csv file.

Csv structure(without header):

id,country,type
01,AU,s1
02,AU,s2
03,GR,s2
03,GR,s2

to load .csv I typed:

lines = sc.textFile("test.txt")

then a distinct count on lines returned 3 as expected:

lines.distinct().count()

But I have no idea how to make a distinct count based on lets say id and country.

Upvotes: 7

Views: 26643

Answers (2)

rami
rami

Reputation: 39

The split line can be optimized as follows:

sc.textFile("test.txt").map(lambda line: line.split(",")[:-1]).distinct().count()

Upvotes: 2

elyase
elyase

Reputation: 40993

In this case you would select the columns you want to consider, and then count:

sc.textFile("test.txt")\
  .map(lambda line: (line.split(',')[0], line.split(',')[1]))\
  .distinct()\
  .count()

This is for clarity, you can optimize the lambda to avoid calling line.split two times.

Upvotes: 8

Related Questions