Reputation: 522
I have a dataframe with 1600 partitions.
Upvotes: 1
Views: 431
Reputation: 3733
I dont know if there is an easy way to get exact size in byte for partition in runtime, but if you want to know this to find skew you can easily get number of records for each partition with something like this (its Scala)
df.mapPartitions(it => Iterator(it.size)).show
If your dataset is cached you get get size in bytes in your codes from statistics. Please remember, that your dataset needs to be cached and there needs to be some action between caching and reading from statistics. If you need sample action you can use something like this: input.cache.foreach(_ => ())
val bytes = spark
.sessionState
.executePlan(repartitioned.queryExecution.logical)
.optimizedPlan
.stats
.sizeInBytes
Same work when you load your data from file, for example csv or json (in this case Spark is creating statistics "for free" during load) or when you are reading from input with compatible and accurate statistics (for example Hive table)
Other options are available via SparkUI. First is as you mentioned just to cache te dataset and get the size from storage tab. Other options is to check input or shuffle write/read sizes during stage which is interesting for you. It can show you if you have some skew
Here you can see a very clear example, this is a list of tasks (1 task = 1 partition) for stage on which i did foreach(_ => ()) on my dataset, it shows pretty well how data are organised within partitions, you can both size and number of records.
Upvotes: 1