Spark each file to a dataset row

Question

I have many files in a directory, each file containing text spanning multiple lines. Currently, I use the following code to read all those files to a spark dataset (>2.0)

   val ddf = spark.read.text("file:///input/*")

However, this creates a dataset where each row is a line, not a file. I'd like to have each file (as string) per row in the dataset.

How can I achieve this without iterating over each file and reading it in separately as RDD?

mrsrinivas · Accepted Answer

Use wholeTextFiles() on SparkContext

val rdd: RDD[(String, String)] = spark.sparkContext
                                      .wholeTextFiles("file/path/to/read/as/rdd")

SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

Spark each file to a dataset row

Answers (2)

Related Questions