Reputation: 4478
For python dataframe, info() function provides memory usage. Is there any equivalent in pyspark ? Thanks
Upvotes: 21
Views: 38650
Reputation: 11
You can use RepartiPy to get the accurate size of your DataFrame as follows:
import repartipy
# Use this if you have enough (executor) memory to cache the whole DataFrame
# If you have NOT enough memory (i.e. too large DataFrame), use 'repartipy.SamplingSizeEstimator' instead.
with repartipy.SizeEstimator(spark=spark, df=df) as se:
df_size_in_bytes = se.estimate()
RepartiPy leverages Caching Approach internally, in order to calculate the in-memory size of your DataFrame.
Please see the docs for more details.
Upvotes: 0
Reputation: 61
For the dataframe df you can do this:
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(df._jdf)
Upvotes: 1
Reputation: 5771
As per the documentation:
The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it into cache, and look at the “Storage” page in the web UI. The page will tell you how much memory the RDD is occupying.
To estimate the memory consumption of a particular object, use SizeEstimator’s estimate method. This is useful for experimenting with different data layouts to trim memory usage, as well as determining the amount of space a broadcast variable will occupy on each executor heap.
Upvotes: 1
Reputation: 444
I have something in mind, its just a rough estimation. as far as i know spark doesn't have a straight forward way to get dataframe memory usage, But Pandas dataframe does. so what you can do is.
sample = df.sample(fraction = 0.01)
pdf = sample.toPandas()
pdf.info()
Upvotes: 20
Reputation: 210972
Try to use the _to_java_object_rdd()
function:
import py4j.protocol
from py4j.protocol import Py4JJavaError
from py4j.java_gateway import JavaObject
from py4j.java_collections import JavaArray, JavaList
from pyspark import RDD, SparkContext
from pyspark.serializers import PickleSerializer, AutoBatchedSerializer
# your dataframe what you'd estimate
df
# Helper function to convert python object to Java objects
def _to_java_object_rdd(rdd):
""" Return a JavaRDD of Object by unpickling
It will convert each Python object into Java object by Pyrolite, whenever the
RDD is serialized in batch or not.
"""
rdd = rdd._reserialize(AutoBatchedSerializer(PickleSerializer()))
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
# First you have to convert it to an RDD
JavaObj = _to_java_object_rdd(df.rdd)
# Now we can run the estimator
sc._jvm.org.apache.spark.util.SizeEstimator.estimate(JavaObj)
Upvotes: 20
Reputation: 823
How about below? It's in KB, X100 to get the estimated real size.
df.sample(fraction = 0.01).cache().count()
Upvotes: -5
Reputation: 4701
You can persist dataframe in memory and take action as df.count(). You would be able to check the size under storage tab on spark web ui.. let me know if it works for you.
Upvotes: 0