Reputation: 1873
I have a folder where files get dropped (daily, weekly) and I need to add the year and week/day, which are in the file name in a consistent format, as variables to my data frame. The prefix can change (e.g., sales_report
, cash_flow
, etc.) but the last characters are always YYYY_WW.csv
.
For instance, for a weekly file I could manually do it for each file as:
from pyspark.sql.functions import lit
df = spark.read.load('my_folder/sales_report_2019_12.csv', format="csv").withColumn("sales_year", lit(2019)).withColumn("sales_week", lit(12))
I would like to do the equivalent of using a substring function counting from the right of the file name to parse the 12
and 2019
. Were I able to parse the file name for these variables I could then read in all of the files in the folder using a wildcard such as df = spark.read.load('my_folder/sales_report_*.csv', format="csv")
which would greatly simplify my code.
Upvotes: 0
Views: 1288
Reputation: 32670
You can easily extract it from the filename using the input_file_name()
column and some string functions like regexp_extract
and substring_index
:
df = spark.read.load('my_folder/*.csv', format="csv")
df = df.withColumn("year_week", regexp_extract(input_file_name(), "\d{4}_\d{1,2}"))\
.withColumn("sales_year", substring_index(col("year_week"), "_", 1))\
.withColumn("sales_week", substring_index(col("year_week"), "_", -1))\
.drop("year_week")
Upvotes: 2
Reputation: 76
you can try the below :
import glob
listfiles = glob.glob('my_folder/sales_report_*.csv')
for files in listfiles:
weekyear = c.split('_',2)[-1].split('_')
week = weekyear[1]
year = weekyear[0]
df = spark.read.load('files', format="csv").withColumn("sales_year", lit(year)).withColumn("sales_week", lit(week))
Upvotes: 1