Reputation: 13
I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc
Upvotes: 0
Views: 1353
Reputation: 1739
You can create a new column on the data frame using withColumn
and input_file_name()
and then use collect()
operation, something like below:
df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()
+-----+
| _c0|
+-----+
|43368|
+-----+
from pyspark.sql.functions import *
df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)
+-----+---------------------------------------------------------------------------------------------------------+
|_c0 |file_name |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+
Now, creating a variable with file_name using collect
and then split it on /
file_name = df1.collect()[0][1].split("/")[3]
print(file_name)
Output
part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv
Please note, in your case index for both collect
as well as well as after split
might be differ.
Upvotes: 1