Tim
Tim

Reputation: 2050

Spark each file to a dataset row

I have many files in a directory, each file containing text spanning multiple lines. Currently, I use the following code to read all those files to a spark dataset (>2.0)

   val ddf = spark.read.text("file:///input/*")

However, this creates a dataset where each row is a line, not a file. I'd like to have each file (as string) per row in the dataset.

How can I achieve this without iterating over each file and reading it in separately as RDD?

Upvotes: 2

Views: 2841

Answers (2)

evan.oman
evan.oman

Reputation: 5572

An alternative to @mrsrinivas's answer would be to group by input_file_name. Given the structure:

evan@vbox>~/junk/so> find .        
.
./d2
./d2/t.txt
./d1
./d1/t.txt
 evan@vbox>~/junk/so> cat  */*.txt
d1_1
d1_2
d2_1
d2_2

We can collect lists based on the input files like so:

scala> val ddf = spark.read.textFile("file:///home/evan/junk/so/*").
     | select($"value", input_file_name as "fName")
ddf: org.apache.spark.sql.DataFrame = [value: string, fName: string]

scala> ddf.show(false)
+-----+----------------------------------+
|value|fName                             |
+-----+----------------------------------+
|d2_1 |file:///home/evan/junk/so/d2/t.txt|
|d2_2 |file:///home/evan/junk/so/d2/t.txt|
|d1_1 |file:///home/evan/junk/so/d1/t.txt|
|d1_2 |file:///home/evan/junk/so/d1/t.txt|
+-----+----------------------------------+

scala> ddf.groupBy("fName").agg(collect_list($"value") as "value").
     | drop("fName").show
+------------+
|       value|
+------------+
|[d1_1, d1_2]|
|[d2_1, d2_2]|
+------------+

Upvotes: 2

mrsrinivas
mrsrinivas

Reputation: 35434

Use wholeTextFiles() on SparkContext

val rdd: RDD[(String, String)] = spark.sparkContext
                                      .wholeTextFiles("file/path/to/read/as/rdd")

SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.

Upvotes: 6

Related Questions