Reputation: 143
Currently I load multiple parquet file with this code :
df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")
(Into the Voucher folder, there is one folder by date, and one parquet file inside it)
How can I add the creation date of each parquet file into my DataFrame ?
Thanks
EDIT 1:
Thanks rainingdistros, I wrote this:
import os
from datetime import datetime, timedelta
Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)
Now I must find a way to loop through all the files and add a column in the DataFrame.
Upvotes: 1
Views: 2224
Reputation: 6114
The information returned from os.stat
might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).
Each time the file is modified, both st_mtime
and st_ctime
will be updated to this modification time. The following are the images indicating the same:
os.stat
.from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
pdf = pd.read_csv(f"{path}/{file}")
pdf.display()
statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
create_date = datetime.fromtimestamp(statinfo.st_ctime)
pdf['creation_date'] = [create_date.date()] * len(pdf)
pdf.to_csv(f"{path}/{file}", index=False)
Upvotes: 1
Reputation: 645
See if below steps help....
Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS
Once you have the files, loop through them and for each file use the code you have written in your question.
Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...
Upvotes: 0