PySpark input_file_name() into a variable NOT df

Question

I want to store the value from input_file_name() into a variable instead of a dataframe. This variable will then be used for logging and troubleshooting.etc

Dipanjan Mallick · Accepted Answer

You can create a new column on the data frame using withColumn and input_file_name() and then use collect() operation, something like below:

df = spark.read.csv("/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv")
df.show()

+-----+
|  _c0|
+-----+
|43368|
+-----+

from pyspark.sql.functions import *

df1 = df.withColumn("file_name", input_file_name())
df1.show(truncate=False)

+-----+---------------------------------------------------------------------------------------------------------+
|_c0  |file_name                                                                                                |
+-----+---------------------------------------------------------------------------------------------------------+
|43368|dbfs:/FileStore/tmp/part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv|
+-----+---------------------------------------------------------------------------------------------------------+

Now, creating a variable with file_name using collect and then split it on /

file_name = df1.collect()[0][1].split("/")[3]

print(file_name)

Output

part-00000-tid-6847462229548084439-4a50d1c2-9b65-4756-9a29-0044d620a1da-11-1-c000.csv

Please note, in your case index for both collect as well as well as after split might be differ.

PySpark input_file_name() into a variable NOT df

Answers (1)

Related Questions