Reputation: 664
Don`t know exactly how to start but , in my use case I am trying to get the size of my HDFS dir using Scala, can someone help here?
I am about to reach this step, but dont know what should I do from here?
val fi = hdfs.listStatus(new Path("/path/path")
fi.foreach(x=> println(x.getPath))
Upvotes: 2
Views: 7344
Reputation: 379
For the pyspark version and Hadoop cluster deployed on Kubernetes (address is handled via DNS) you can do the following
hadoop = spark._jvm.org.apache.hadoop
fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
conf.set( "fs.defaultFS", "hdfs://hdfs.hdfs:/myhomefolder" )
path = hadoop.fs.Path('/path/')
print(fs.get(conf).getContentSummary(path).getLength())
Upvotes: 3
Reputation: 451
This will give you size (disk space) in bytes of HDFS directory using scala spark
import org.apache.hadoop.fs.{FileSystem, Path}
val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)
fs.getContentSummary(new Path("/path/path")).getLength
Upvotes: 1
Reputation: 1416
This gives you a FileStatus
instance.
val fi = hdfs.listStatus(new Path("/path/path")
You can call the getBlockSize
on FileStatus
.
Following is the documented method in the class
/**
* Get the block size of the file.
* @return the number of bytes
*/
public long getBlockSize() {
return blocksize;
}
Upvotes: 0