Bob
Bob

Reputation: 661

How to find the size of a dataframe in pyspark

How can I replicate this code to get the dataframe size in pyspark?

scala> val df = spark.range(10)
scala> print(spark.sessionState.executePlan(df.queryExecution.logical).optimizedPlan.stats)
Statistics(sizeInBytes=80.0 B, hints=none)

What I would like to do is get the sizeInBytes value into a variable.

Upvotes: 5

Views: 9549

Answers (4)

sakjung
sakjung

Reputation: 11

You can use RepartiPy to get the accurate size of your DataFrame as follows:

import repartipy

# Use this if you have enough (executor) memory to cache the whole DataFrame
# If you have NOT enough memory (i.e. too large DataFrame), use 'repartipy.SamplingSizeEstimator' instead.
with repartipy.SizeEstimator(spark=spark, df=df) as se:
    df_size_in_bytes = se.estimate()

RepartiPy leverages executePlan method internally, as you mentioned already, in order to calculate the in-memory size of your DataFrame.

Please see the docs for more details.

Upvotes: 0

Kyle Osborne
Kyle Osborne

Reputation: 61

Typically, you can access the scala methods through py4j. I just tried this in the pyspark shell:

>>> spark._jsparkSession.sessionState().executePlan(df._jdf.queryExecution().logical()).optimizedPlan().stats().sizeInBytes()
716

Upvotes: 3

Som
Som

Reputation: 6338

See of this helps-

Reading the json file source and computing stats like size in bytes , number of rows etc. This stat will also help spark to take it=ntelligent decision while optimizing execution plan This code should be same in pysparktoo

/**
      * file content
      * spark-test-data.json
      * --------------------
      * {"id":1,"name":"abc1"}
      * {"id":2,"name":"abc2"}
      * {"id":3,"name":"abc3"}
      */
    val fileName = "spark-test-data.json"
    val path = getClass.getResource("/" + fileName).getPath

    spark.catalog.createTable("df", path, "json")
      .show(false)

    /**
      * +---+----+
      * |id |name|
      * +---+----+
      * |1  |abc1|
      * |2  |abc2|
      * |3  |abc3|
      * +---+----+
      */
    // Collect only statistics that do not require scanning the whole table (that is, size in bytes).
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS NOSCAN")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+---------+-------+
      * |col_name  |data_type|comment|
      * +----------+---------+-------+
      * |Statistics|68 bytes |       |
      * +----------+---------+-------+
      */
    spark.sql("ANALYZE TABLE df COMPUTE STATISTICS")
    spark.sql("DESCRIBE EXTENDED df ").filter(col("col_name") === "Statistics").show(false)

    /**
      * +----------+----------------+-------+
      * |col_name  |data_type       |comment|
      * +----------+----------------+-------+
      * |Statistics|68 bytes, 3 rows|       |
      * +----------+----------------+-------+
      */

more info - databricks sql doc

Upvotes: 3

David Vrba
David Vrba

Reputation: 3344

In Spark 2.4 you can do

df = spark.range(10)
df.createOrReplaceTempView('myView')
spark.sql('explain cost select * from myView').show(truncate=False)

|== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B, hints=none)

In Spark 3.0.0-preview2 you can use explain with the cost mode:

df = spark.range(10)
df.explain(mode='cost')

== Optimized Logical Plan ==
Range (0, 10, step=1, splits=Some(8)), Statistics(sizeInBytes=80.0 B)

Upvotes: 17

Related Questions