sande
sande

Reputation: 664

How to get size of hdfs directory using spark

Don`t know exactly how to start but , in my use case I am trying to get the size of my HDFS dir using Scala, can someone help here?

I am about to reach this step, but dont know what should I do from here?

val fi = hdfs.listStatus(new Path("/path/path")
fi.foreach(x=> println(x.getPath))

Upvotes: 2

Views: 7344

Answers (3)

sajjad
sajjad

Reputation: 379

For the pyspark version and Hadoop cluster deployed on Kubernetes (address is handled via DNS) you can do the following


hadoop = spark._jvm.org.apache.hadoop

fs = hadoop.fs.FileSystem
conf = hadoop.conf.Configuration()
conf.set( "fs.defaultFS", "hdfs://hdfs.hdfs:/myhomefolder" )
path = hadoop.fs.Path('/path/')

print(fs.get(conf).getContentSummary(path).getLength())

Upvotes: 3

Samir Vyas
Samir Vyas

Reputation: 451

This will give you size (disk space) in bytes of HDFS directory using scala spark

import org.apache.hadoop.fs.{FileSystem, Path}

val fs: FileSystem = FileSystem.get(spark.sparkContext.hadoopConfiguration)

fs.getContentSummary(new Path("/path/path")).getLength

Upvotes: 1

Constantine
Constantine

Reputation: 1416

This gives you a FileStatus instance.

val fi = hdfs.listStatus(new Path("/path/path")

You can call the getBlockSize on FileStatus.

Following is the documented method in the class

/**
   * Get the block size of the file.
   * @return the number of bytes
   */
  public long getBlockSize() {
    return blocksize;
  }

Upvotes: 0

Related Questions