Comparing csv files with pySpark

Question

i'm brand new to pyspark, but i need to digg into it very fast. I want to compare two (huge) csv files in pyspark and managed so far quite okay (I'm pretty sure, my code is way not fancy) In the end i'd like to count the records which are matching and those which are not matching.

What i was able to achive is:

1. Loading csv into RDD's.

act="actual.csv"
exp="expected.csv"
raw_exp = sc.textFile(exp)                                                  
raw_act = sc.textFile(act)

2. i can count the amount of records by using .count()

print "Expected: ", raw_exp.count()
print "Actual:", raw_act.count()

3. I can compare the rdds by using subtract and collect to get the records which dont match:

notCompRecords  = raw_exp.subtract(raw_act).collect()

Now i want to count those records which don't match. I thought i would use:

notCompRecords.count()

but i got the error that an Argument is missing:

TypeError: count() takes at least 1 argument (0 given)

I also learned that i have to convert an List which notComRecords obvously is into a string by:

notCompString   = ''.join(notCompRecords)

but this also doesn't work.

How can i count the lines in the Object/Variable/rdd notCompRecords?

Thanks! Any hint or clue is appreciated. Best Regards,

Comparing csv files with pySpark

Answers (1)

Related Questions