Salah K.
Salah K.

Reputation: 143

Add the creation date of a parquet file into a DataFrame

Currently I load multiple parquet file with this code :

df = spark.read.parquet("/mnt/dev/bronze/Voucher/*/*")

(Into the Voucher folder, there is one folder by date, and one parquet file inside it)

How can I add the creation date of each parquet file into my DataFrame ?

Thanks

EDIT 1:

Thanks rainingdistros, I wrote this:

import os
from datetime import datetime, timedelta 

Path = "/dbfs/mnt/dev/bronze/Voucher/2022-09-23/"
fileFull = Path +'/'+'XXXXXX.parquet'
statinfo = os.stat(fileFull)
create_date = datetime.fromtimestamp(statinfo.st_ctime)
display(create_date)

Now I must find a way to loop through all the files and add a column in the DataFrame.

Upvotes: 1

Views: 2224

Answers (2)

Saideep Arikontham
Saideep Arikontham

Reputation: 6114

  • The information returned from os.stat might not be accurate unless the file is first operation on these files is your requirement (i.e., adding the additional column with creation time).

  • Each time the file is modified, both st_mtime and st_ctime will be updated to this modification time. The following are the images indicating the same:

enter image description here

  • When I modify this file, the changes can be observed in the information returned by os.stat.

enter image description here

  • So, if adding this column is the first operation that is going to be performed on these files, then you can use the following code to add this date as column to your files.
from pyspark.sql.functions import lit
import pandas as pd
path = "/dbfs/mnt/repro/2022-12-01"
fileinfo = os.listdir(path)
for file in fileinfo:
    pdf = pd.read_csv(f"{path}/{file}")
    pdf.display()
    statinfo = os.stat("/dbfs/mnt/repro/2022-12-01/sample1.csv")
    create_date = datetime.fromtimestamp(statinfo.st_ctime)
    pdf['creation_date'] = [create_date.date()] * len(pdf)
    pdf.to_csv(f"{path}/{file}", index=False)

enter image description here

  • These files would have this new column as shown below after running the code:

enter image description here

  • It might be better to take the value directly from folder in this case as the information is already available and all that needs to be done is to extract and add column to files in a similar manner as in the above code.

Upvotes: 1

rainingdistros
rainingdistros

Reputation: 645

See if below steps help....

  1. Refer to the link to get the list of files in DBFS - SO - Loop through Files in DBFS

  2. Once you have the files, loop through them and for each file use the code you have written in your question.

Please note that dbutils has the mtime of a file in it. The os module provides way to identify the ctime i.e. the time of most recent metadata changes on Unix, - ideally should have been st_birthtime - but that does not seem to work in my trials...Hope it works for you...

Upvotes: 0

Related Questions