piterd
piterd

Reputation: 117

pyspark - multiple input files into one RDD and one output file

I have a wordcount in Python that I want to run on Spark with multiple text files and get ONE output file, so the words are counted in all files altogether. I tried a few solutions for example the ones found here and here, but it still gives the same number of output files as the number of input files.

rdd = sc.textFile("file:///path/*.txt")
input = sc.textFile(join(rdd))

or

rdd = sc.textFile("file:///path/f0.txt,file:///path/f1.txt,...")
rdds = Seq(rdd)
input = sc.textFile(','.join(rdds))

or

rdd = sc.textFile("file:///path/*.txt")
input = sc.union(rdd)

don't work. Can anybody suggest a solution how to make one RDD of a few input text files?

Thanks in advance...

Upvotes: 5

Views: 8089

Answers (1)

Mohitt
Mohitt

Reputation: 2977

This should load all the files matching the pattern.

rdd = sc.textFile("file:///path/*.txt")

Now, you do not need to do any union. You have only one RDD.

Coming to your question - why are you getting many output files. The number of output files depends on number of partitions in the RDD. When you run word count logic, your resultant RDD can have more than 1 partitions. If you want to save the RDD as single file, use coalesce or repartition to have only one partition.

The code below works, taken from Examples.

rdd = sc.textFile("file:///path/*.txt")
counts = rdd.flatMap(lambda line: line.split(" ")) \
...              .map(lambda word: (word, 1)) \
...              .reduceByKey(lambda a, b: a + b)

counts.coalesce(1).saveAsTextFile("res.csv")

Upvotes: 9

Related Questions