Antony
Antony

Reputation: 1118

Connect to HDFS location from Azure databricks notebook to get the file number and size

I am trying to connect to hdfs location from databricks notebook to get the file details.

below are the code which I tried to obtain the same

%fs ls dbfs:/mnt/<mount>/dev/data/audit/

And i obtained result with size as 0. And only the details of folder audit, not any of its subfolders. enter image description here

audit folder is having 5 more subfolders with files inside that. I want to get number of files in each subfolders and total size of those 5 subfolders.

I tried below dbutils in scala, but it doesnt have any function to get number of files or size of a file.

  1. Is there any way to get the size of folders and sub folders in hdfs from databricks notebook?
  2. Is there any way to get number of files in folders and sub folders in hdfs from databricks notebook?

enter image description here

Upvotes: 0

Views: 1899

Answers (1)

Bunyod
Bunyod

Reputation: 1371

There is no simple method in dbutils that returns size of directory or number of files in a directory. However you can do calculation iterating directories recursively.

1. Number of files recursive calculation:

import scala.annotation.tailrec
import com.databricks.backend.daemon.dbutils.FileInfo
import com.databricks.dbutils_v1

private lazy val dbutils = dbutils_v1.DBUtilsHolder.dbutils

def numberOfFiles(location: String): Int = {
  @tailrec
  def go(items: List[FileInfo], result: Int): Int = items match {
    case head :: tail =>
      val files = dbutils.fs.ls(head.path)
      val directories = files.filter(_.isDir)
      go(tail ++ directories, result + files.size)
    case _ => result
  }

  go(dbutils.fs.ls(location).toList, 0)
}

2. Total size of a folder

import com.databricks.backend.daemon.dbutils.FileInfo

def sizeOfDirectory(location: String): Long = {
  @tailrec
  def go(items: List[FileInfo], result: Long): Long = items match {
    case head :: tail =>
      val files = dbutils.fs.ls(head.path)
      val directories = files.filter(_.isDir)
      val updated = files.map(_.size).foldLeft(result)(_ + _)
      go(tail ++ directories, updated)
    case _ => result
  }

  go(dbutils.fs.ls(location).toList, 0)
}

I hope this helps

Upvotes: 0

Related Questions