batman_special
batman_special

Reputation: 125

script to get the file last modified date and file name pyspark

I have a mount point location which is pointing to a blob storage where we have multiple files. We need to find the last modified date for a file along with the file name. I am using the below script and the list of files are as below:

/mnt/schema_id=na/184000-9.jsonl
/mnt/schema_id=na/185000-0.jsonl
/mnt/schema_id=na/185000-22.jsonl
/mnt/schema_id=na/185000-25.jsonl
import os
import time
# Path to the file/directory
path = "/mnt/schema_id=na"
         
ti_c = os.path.getctime(path)
ti_m = os.path.getmtime(path)
        
c_ti = time.ctime(ti_c)
m_ti = time.ctime(ti_m)
          
print(f"The file located at the path {path} was created at {c_ti} and was last modified at {m_ti}")

Upvotes: 4

Views: 13159

Answers (3)

ARCrow
ARCrow

Reputation: 1857

If you're working in Databricks, since Databricks runtime 10.4 released on Mar 18, 2022, dbutils.fs.ls() command returns “modificationTime” of the folders and files as well: enter image description here

Upvotes: 2

ewokx
ewokx

Reputation: 2425

Here's one way you can achieve it:

import os
import time
# Path to the file/directory
path = "/dbfs/mnt/schema_id=na"

for file_item in os.listdir(path):
    file_path = os.path.join(path, file_item)
    ti_c = os.path.getctime(file_path)
    ti_m = os.path.getmtime(file_path)
        
    c_ti = time.ctime(ti_c)
    m_ti = time.ctime(ti_m)
          
    print(f"The file {file_item} located at the path {path} was created at {c_ti} and was last modified at {m_ti}")

Upvotes: 1

Alex Ott
Alex Ott

Reputation: 87174

If you're using operating system-level commands to get file information, then you can't access that exact location - on Databricks it's on the Databricks file system (DBFS).

To get that on the Python level, you need to prepend the /dbfs to the path, so it will be:

...
path = "/dbfs/mnt/schema_id=na"
for file_item in os.listdir(path):
    file_path = os.path.join(path, file_item)[:5]
    ti_c = os.path.getctime(file_path)
    ...

note the [:5] - it's used to strip the /dbfs prefix from the path to make it compatible with DBFS

Upvotes: 4

Related Questions