Reputation: 3830
I want to read all files in a nested directory, and perform some transformation on each of them. However, I also need some information from the actual path of the files. This is the current:
sdf = spark.read.text(path)
sdf.show()
+--------------------+
| value|
+--------------------+
| ...|
+--------------------+
| ...|
+--------------------+
However, I want something like:
sdf = spark.read.text(path, include_file_paths=True)
sdf.show()
+--------------------+---------+
| value| path|
+--------------------+---------+
| ...| /a/b.txt|
+--------------------+---------+
| ...| /c/d.txt|
+--------------------+---------+
This seems like something that should be possible, but I can not find any resources describing it. What am I missing?
Upvotes: 1
Views: 2471
Reputation: 2518
You can use the input_file_name built-in function as follow :
sdf.withColumn("path", input_file_name)
This built-in function is executed at Task level.
Upvotes: 3