Reputation: 1118
I am trying to connect to hdfs location from databricks notebook to get the file details.
below are the code which I tried to obtain the same
%fs ls dbfs:/mnt/<mount>/dev/data/audit/
And i obtained result with size as 0. And only the details of folder audit, not any of its subfolders.
audit folder is having 5 more subfolders with files inside that. I want to get number of files in each subfolders and total size of those 5 subfolders.
I tried below dbutils in scala, but it doesnt have any function to get number of files or size of a file.
Upvotes: 0
Views: 1899
Reputation: 1371
There is no simple method in dbutils
that returns size of directory or number of files in a directory. However you can do calculation iterating directories recursively.
1. Number of files recursive calculation:
import scala.annotation.tailrec
import com.databricks.backend.daemon.dbutils.FileInfo
import com.databricks.dbutils_v1
private lazy val dbutils = dbutils_v1.DBUtilsHolder.dbutils
def numberOfFiles(location: String): Int = {
@tailrec
def go(items: List[FileInfo], result: Int): Int = items match {
case head :: tail =>
val files = dbutils.fs.ls(head.path)
val directories = files.filter(_.isDir)
go(tail ++ directories, result + files.size)
case _ => result
}
go(dbutils.fs.ls(location).toList, 0)
}
2. Total size of a folder
import com.databricks.backend.daemon.dbutils.FileInfo
def sizeOfDirectory(location: String): Long = {
@tailrec
def go(items: List[FileInfo], result: Long): Long = items match {
case head :: tail =>
val files = dbutils.fs.ls(head.path)
val directories = files.filter(_.isDir)
val updated = files.map(_.size).foldLeft(result)(_ + _)
go(tail ++ directories, updated)
case _ => result
}
go(dbutils.fs.ls(location).toList, 0)
}
I hope this helps
Upvotes: 0