Reputation: 4478

How to find pyspark dataframe memory usage?

For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks

Upvotes: 21

Answers (7)

sakjung

Reputation: 11

You can use RepartiPy to get the accurate size of your DataFrame as follows:

import repartipy

# Use this if you have enough (executor) memory to cache the whole DataFrame
# If you have NOT enough memory (i.e. too large DataFrame), use 'repartipy.SamplingSizeEstimator' instead.
with repartipy.SizeEstimator(spark=spark, df=df) as se:
    df_size_in_bytes = se.estimate()

RepartiPy leverages Caching Approach internally, in order to calculate the in-memory size of your DataFrame.

Please see the docs for more details.

Upvotes: 0

Antonio

Reputation: 61

For the dataframe df you can do this:

sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)

Upvotes: 1

figs_and_nuts

Reputation: 5771

As per the documentation:

The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.

To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.

Upvotes: 1

Vipin Chaudhary

Reputation: 444

I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.

select 1% of data sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
get pandas dataframe memory usage by pdf.info()
Multiply that values by 100, this should give a rough estimate of your whole spark dataframe memory usage.
Correct me if i am wrong :|

Upvotes: 20

MaxU - stand with Ukraine

Reputation: 210972

Try to use the _to_java_object_rdd() function:

import py4j.protocol  
from py4j.protocol import Py4JJavaError  
from py4j.java_gateway import JavaObject  
from py4j.java_collections import JavaArray, JavaList

from pyspark import RDD, SparkContext  
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer

# your dataframe what you'd estimate
df

# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):  
    """ Return a JavaRDD of Object by unpickling
    It will convert each Python object into Java object by Pyrolite, whenever the
    RDD is serialized in batch or not.
    """
    rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
    return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)

# First you have to convert it to an RDD 
JavaObj = _to_java_object_rdd(df.rdd)

# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)

Upvotes: 20

Victor Z

Reputation: 823

How about below? It's in KB, X100 to get the estimated real size.

df.sample(fraction = 0.01).cache().count()

Upvotes: -5

vikrant rana

Reputation: 4701

You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.

Upvotes: 0

How to find pyspark dataframe memory usage?

Answers (7)

Related Questions