Reputation: 3062
I know how to find the file size in scala.But how to find a RDD/dataframe size in spark?
Scala:
object Main extends App {
val file = new java.io.File("hdfs://localhost:9000/samplefile.txt").toString()
println(file.length)
}
Spark:
val distFile = sc.textFile(file)
println(distFile.length)
but if i process it not getting file size. How to find the RDD size?
Upvotes: 51
Views: 137693
Reputation: 29155
Below is one way apart from SizeEstimator
.I use frequently
To know from code about an RDD if it is cached, and more precisely, how many of its partitions are cached in memory and how many are cached on disk? to get the storage level, also want to know the current actual caching status.to Know memory consumption.
Spark Context has developer api method getRDDStorageInfo() Occasionally you can use this.
Return information about what RDDs are cached, if they are in mem or on disk, how much space they take, etc.
For Example :
scala> sc.getRDDStorageInfo res3: Array[org.apache.spark.storage.RDDInfo] = Array(RDD "HiveTableScan [name#0], (MetastoreRelation sparkdb, firsttable, None), None " (3) StorageLevel: StorageLevel(false, true, false, true, 1); CachedPartitions: 1;
TotalPartitions: 1; MemorySize: 256.0 B; ExternalBlockStoreSize: 0.0 B; DiskSize: 0.0 B)
Description
With SPARK-13992, Spark supports persisting data into off-heap memory, but the usage of off-heap is not exposed currently, it is not so convenient for user to monitor and profile, so here propose to expose off-heap memory as well as on-heap memory usage in various places:
- Spark UI's executor page will display both on-heap and off-heap memory usage.
- REST request returns both on-heap and off-heap memory.
- Also these two memory usage can be obtained programmatically from SparkListener.
Upvotes: 7
Reputation: 3062
Yes Finally I got the solution. Include these libraries.
import org.apache.spark.sql.Row
import org.apache.spark.rdd.RDD
import org.apache.spark.rdd
How to find the RDD Size:
def calcRDDSize(rdd: RDD[String]): Long = {
rdd.map(_.getBytes("UTF-8").length.toLong)
.reduce(_+_) //add the sizes together
}
Function to find DataFrame size: (This function just convert DataFrame to RDD internally)
val dataFrame = sc.textFile(args(1)).toDF() // you can replace args(1) with any path
val rddOfDataframe = dataFrame.rdd.map(_.toString())
val size = calcRDDSize(rddOfDataframe)
Upvotes: 14
Reputation: 13154
If you are simply looking to count the number of rows in the rdd
, do:
val distFile = sc.textFile(file)
println(distFile.count)
If you are interested in the bytes, you can use the SizeEstimator
:
import org.apache.spark.util.SizeEstimator
println(SizeEstimator.estimate(distFile))
https://spark.apache.org/docs/latest/api/java/org/apache/spark/util/SizeEstimator.html
Upvotes: 78