Satyaranjan Behera
Satyaranjan Behera

Reputation: 117

How to get the last modification time of each files present in azure datalake storage using python in databricks workspace?

I am trying to get the last modification time of each file present in azure data lake.

files = dbutils.fs.ls('/mnt/blob')

for fi in files: print(fi)

Output:-FileInfo(path='dbfs:/mnt/blob/rule_sheet_recon.xlsx', name='rule_sheet_recon.xlsx', size=10843)

Here i am unable to get the last modification time of the files. Is there any way to get that property.

I tries this below shell command to see the properties,but unable to store it in python object.

%sh ls -ls /dbfs/mnt/blob/

output:- total 0

0 -rw-r--r-- 1 root root 13577 Sep 20 10:50 a.txt

0 -rw-r--r-- 1 root root 10843 Sep 20 10:50 b.txt

Upvotes: 9

Views: 20615

Answers (3)

stephen meckstroth
stephen meckstroth

Reputation: 345

You can avoid the need for account keys, and storage blob coding and such if you already have your DataBricks cluster configured.

import time

path = spark._jvm.org.apache.hadoop.fs.Path
sc = spark.SparkContext
fs = path('abfss://[email protected]/').getFileSystem(sc._jsc.hadoopConfiguration())

res = fs.listFiles(path('abfss://[email protected]/your/storage/path'), True)

while res.hasNext():
  file = res.next()
  localTime = time.strftime("%Y-%m-%d %H:%M:%S", time.localtime(file.getModificationTime() / 1000))
  print(f"{file.getPath()}: {localTime}")

Upvotes: 4

Naveen Anto
Naveen Anto

Reputation: 97

We can use os package to get the information. For example in pyspark

import os

def get_filemtime(filename):
  return os.path.getmtime(filename)

You can pass the absolute path of the filename like dbfs:/mnt/adls/logs/ehub/app/0/2021/07/21/15/05/40.avro

Upvotes: 1

We don't have direct method to get those details . But We got those details based on the following simple python code .

Example : Consider, you want to get the all subdirectories and files in adls path container_name/container-Second --- You can use below code

from pyspark.sql.functions import col
from azure.storage.blob import BlockBlobService
from datetime import datetime
import os.path

block_blob_service = BlockBlobService(account_name='account-name', account_key='account-key')
container_name ='container-firstname'
second_conatainer_name ='container-Second'
#block_blob_service.create_container(container_name)
generator = block_blob_service.list_blobs(container_name,prefix="Recovery/")
report_time = datetime.now().strftime('%Y-%m-%d %H:%M:%S')


for blob in generator:
    length = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    last_modified = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.last_modified
    file_size = BlockBlobService.get_blob_properties(block_blob_service,container_name,blob.name).properties.content_length
    line = container_name+'|'+second_conatainer_name+'|'+blob.name+'|'+ str(file_size) +'|'+str(last_modified)+'|'+str(report_time)
    print(line)

Screen Print of Note

Upvotes: 1

Related Questions