Reputation: 2050
I have many files in a directory, each file containing text spanning multiple lines. Currently, I use the following code to read all those files to a spark dataset (>2.0)
val ddf = spark.read.text("file:///input/*")
However, this creates a dataset where each row is a line, not a file. I'd like to have each file (as string) per row in the dataset.
How can I achieve this without iterating over each file and reading it in separately as RDD
?
Upvotes: 2
Views: 2841
Reputation: 5572
An alternative to @mrsrinivas's answer would be to group by input_file_name
. Given the structure:
evan@vbox>~/junk/so> find .
.
./d2
./d2/t.txt
./d1
./d1/t.txt
evan@vbox>~/junk/so> cat */*.txt
d1_1
d1_2
d2_1
d2_2
We can collect lists based on the input files like so:
scala> val ddf = spark.read.textFile("file:///home/evan/junk/so/*").
| select($"value", input_file_name as "fName")
ddf: org.apache.spark.sql.DataFrame = [value: string, fName: string]
scala> ddf.show(false)
+-----+----------------------------------+
|value|fName |
+-----+----------------------------------+
|d2_1 |file:///home/evan/junk/so/d2/t.txt|
|d2_2 |file:///home/evan/junk/so/d2/t.txt|
|d1_1 |file:///home/evan/junk/so/d1/t.txt|
|d1_2 |file:///home/evan/junk/so/d1/t.txt|
+-----+----------------------------------+
scala> ddf.groupBy("fName").agg(collect_list($"value") as "value").
| drop("fName").show
+------------+
| value|
+------------+
|[d1_1, d1_2]|
|[d2_1, d2_2]|
+------------+
Upvotes: 2
Reputation: 35434
Use wholeTextFiles()
on SparkContext
val rdd: RDD[(String, String)] = spark.sparkContext
.wholeTextFiles("file/path/to/read/as/rdd")
SparkContext.wholeTextFiles lets you read a directory containing multiple small text files, and returns each of them as (filename, content) pairs. This is in contrast with textFile, which would return one record per line in each file.
Upvotes: 6