Reputation: 363
I am using pyspark as code language. I added column to get filename with path.
from pyspark.sql.functions import input_file_name
data = data.withColumn("sourcefile",input_file_name())
I want to retrieve only filename with it's parent folder from this column. Please help.
Example:
Inputfilename = "adl://dotdot.com/ingest/marketing/abc.json"
What output I am looking for is:
marketing/abc.json
Note: String operation I can do. The filepath column is part of dataframe.
Upvotes: 9
Views: 18769
Reputation: 1240
The input_file_name() function wasn't working reliably for me when using spark 3.3.2, delta_spark 2.3.0, and delta-parquet files; instead, the following seems to work better:
regex_str = "[\/]([^\/]+[\/][^\/]+)$"
data = data.withColumn("sourcefile", regexp_extract("_metadata.file_path",regex_str,1))
Also, if using Databricks, according to the following link, input_file_name is deprecated: https://docs.databricks.com/en/sql/language-manual/functions/input_file_name.html
In Databricks SQL and Databricks Runtime 13.1 and above this function is deprecated. Please use _metadata.file_name.
Upvotes: 0
Reputation: 286
If you want to keep the value in a dataframe column you could use the pyspark.sql.function regexp_extract. You can apply it to the column with the value of path and passing the regular expression required to extract the desired part:
data = data.withColumn("sourcefile",input_file_name())
regex_str = "[\/]([^\/]+[\/][^\/]+)$"
data = data.withColumn("sourcefile", regexp_extract("sourcefile",regex_str,1))
Upvotes: 11
Reputation: 15258
I think that what you are looking for is :
sc.wholeTextFiles('path/to/files').map(
lambda x : ( '/'.join(x[0].split('/')[-2:]), x[1])
)
This create a rdd with 2 columns, 1st one is the path to file
, second one is the content of the file. That is the only way to link a path and a content in spark.
Other method exists in Hive for example.
Upvotes: 0