casparjespersen
casparjespersen

Reputation: 3830

Spark read folder directory with file names included in resulting data frame

I want to read all files in a nested directory, and perform some transformation on each of them. However, I also need some information from the actual path of the files. This is the current:

sdf = spark.read.text(path)
sdf.show()
+--------------------+
|               value|
+--------------------+
|                 ...|
+--------------------+
|                 ...|
+--------------------+

However, I want something like:

sdf = spark.read.text(path, include_file_paths=True)
sdf.show()
+--------------------+---------+
|               value|     path|
+--------------------+---------+
|                 ...| /a/b.txt|
+--------------------+---------+
|                 ...| /c/d.txt|
+--------------------+---------+

This seems like something that should be possible, but I can not find any resources describing it. What am I missing?

Upvotes: 1

Views: 2471

Answers (1)

baitmbarek
baitmbarek

Reputation: 2518

You can use the input_file_name built-in function as follow :

sdf.withColumn("path", input_file_name)

This built-in function is executed at Task level.

Upvotes: 3

Related Questions