Pyspark to iterate through year, month and date folders and subfolders to get the latest file

Could someone please help with some code in pyspark to loop over folders and subfolders to get the latest file.

The folder and subfolders are like below. Now I want to loop over to the latest year folder, and then latest month folder and then latest date folder to get the file.

Raw/2019
Raw/2020/06/21
Raw/2021/03/18/file.csv
Raw/2021/04/13/file.csv
Raw/2021/04/14/file.csv

Upvotes: 1

Answers (2)

Vaebhav

Reputation: 5032

The best way to get the latest directory would be the File System API pltc mentioned

To add further , I have a small utility function that iterates the input path recursively leveraging the File System API and sorts them using getctime -

def latest_dir(inp):
  
  def recur_directory(inp,res=[]):
      inp_dir = dbutils.fs.ls(inp)
      for dr in inp_dir:
          if os.path.isdir("/dbfs" + dr.path[5:]):
              d = recur_directory(dr.path,res)
          else:
            res += [inp]
            break
              
      return res
  
  dir_lst = [ recur_directory(x.path) for x in dbutils.fs.ls(inp[5:]) ][0]
  
  return sorted(["/dbfs" + x[5:] for x in dir_lst], key=os.path.getctime,reverse=True)

Upvotes: 0

pltc

Reputation: 6082

You wouldn't need Spark to do that. Since you're using Azure Databricks, you should use Databricks File System API instead, so something like this

lst = dbutils.fs.ls("dbfs:/Raw/")
print(sorted(lst, reverse=True)[0])

Upvotes: 1

Pyspark to iterate through year, month and date folders and subfolders to get the latest file

Answers (2)

Related Questions