Vitali Dedkov
Vitali Dedkov

Reputation: 318

Inserting a date variable into Dataframe with a sting file path (read.csv)

I am working through a process where I want to ingest a csv file into a Dataframe. This file is a delta file that runs daily and is stored in Azure DataLake store.

DF = (

  spark
  .read.option("header",True)\.option("inferSchema", "true").option("delimiter", "|")
  .csv("folder2/folder1/Intenstion_file2020*.csv")
)

From the above code I basically collect all the files that start with "file2020" and then all other files. So if there is 10 then it gets put into one dataframe.

What I want to do through is instead of ingesting all of those 10 files into a dataframe is instead select the file that matches a system date. So if I have the following files: ​ ​1) file2020/01/01 ​2) file2020/01/02 ​3) file2020/01/09 ​ ​I want only the third file to be ingested. Then the next time it would select the next file that has the most current date.

I tried solving this by first getting a system date. This runs before the dataframe portion.

 #Getting System Time Stamp
import datetime
date_value = datetime.datetime.now()
print(datetime.datetime.strftime(date_value,'%Y/%m/%d'))

So if I run that above notebook I would have "date_value" = 2020/01/09. What I wanted to do is then concatanate that value into the "csv(path)" in the dataframe example above.

So instead of having

.csv("folder2/folder1/Intenstion_file2020*.csv")

I would have something like:

.csv(concat_ws("....file" date_value "*.csv"))

So it would automatically find the file with the date that is closest to system date.

I tried some variables of above, but I am missing the proper syntax or if what I am doing above is possible. Has anyone tried to do the above?

Any help is appreciated.​

Update 01/09/2020 I updated the question to make it clearer as to what I am trying to achieve.

Upvotes: 0

Views: 474

Answers (1)

pkatta
pkatta

Reputation: 9

I guess the way you are using concat_ws is wrong.

Please refer this - https://spark.apache.org/docs/2.1.0/api/python/pyspark.sql.html#pyspark.sql.functions.concat_ws

And more over you cannot concatenate a column and string. It should be two columns.

So use f.concat_ws("-", df.colA, f.lit("date_value"))

Upvotes: 0

Related Questions