Reputation: 318
I am working through a process where I want to ingest a csv file into a Dataframe. This file is a delta file that runs daily and is stored in Azure DataLake store.
DF = (
spark
.read.option("header",True)\.option("inferSchema", "true").option("delimiter", "|")
.csv("folder2/folder1/Intenstion_file2020*.csv")
)
From the above code I basically collect all the files that start with "file2020" and then all other files. So if there is 10 then it gets put into one dataframe.
What I want to do through is instead of ingesting all of those 10 files into a dataframe is instead select the file that matches a system date. So if I have the following files: 1) file2020/01/01 2) file2020/01/02 3) file2020/01/09 I want only the third file to be ingested. Then the next time it would select the next file that has the most current date.
I tried solving this by first getting a system date. This runs before the dataframe portion.
#Getting System Time Stamp
import datetime
date_value = datetime.datetime.now()
print(datetime.datetime.strftime(date_value,'%Y/%m/%d'))
So if I run that above notebook I would have "date_value" = 2020/01/09. What I wanted to do is then concatanate that value into the "csv(path)" in the dataframe example above.
So instead of having
.csv("folder2/folder1/Intenstion_file2020*.csv")
I would have something like:
.csv(concat_ws("....file" date_value "*.csv"))
So it would automatically find the file with the date that is closest to system date.
I tried some variables of above, but I am missing the proper syntax or if what I am doing above is possible. Has anyone tried to do the above?
Any help is appreciated.
Update 01/09/2020 I updated the question to make it clearer as to what I am trying to achieve.
Upvotes: 0
Views: 474
Reputation: 9
I guess the way you are using concat_ws is wrong.
Please refer this - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws
And more over you cannot concatenate a column and string. It should be two columns.
So use f.concat_ws("-", df.colA, f.lit("date_value"))
Upvotes: 0