Tokyo
Tokyo

Reputation: 823

Describe a Dataframe on PySpark

I have a fairly large Parquet file which I am loading using

file = spark.read.parquet('hdfs/directory/test.parquet')

Now I want to get some statistics (similar to pandas describe() function). What I've tried to do was:

file_pd = file.toPandas()
file_pd.describe()

but obviously this requires to load all the data in memory and it will fail. Can anyone suggest a workaround?

Upvotes: 4

Views: 32760

Answers (3)

gustavolq
gustavolq

Reputation: 428

In Spark you can use df.describe() or df.summary() to check statistical information.

The difference is that df.summary() returns the same information as df.describe() plus quartile information (25%, 50% and 75%).

If you want to delete string columns, you can use a list comprehension to access the values of dtypes, which returns a tuple ('column_name', 'column_type'), and delete the string type, passing these columns as a parameter to df.select().

Command example:

df.select([col[0] for col in df.dtypes if col[1] != 'string']).describe().show()

Upvotes: 2

Hari_pb
Hari_pb

Reputation: 7406

Even though its not exactly related to the question asked, but similar to hive or SQL based describe function to see data types, you can simple do

df.printSchema()

This will give you description of data types of the data frame

Upvotes: 0

ollik1
ollik1

Reputation: 4540

What are the stats you need? Spark has a similar feature

file.summary().show()
+-------+----+
|summary|test|
+-------+----+
|  count|   3|
|   mean| 2.0|
| stddev| 1.0|
|    min|   1|
|    25%|   1|
|    50%|   2|
|    75%|   3|
|    max|   3|
+-------+----+

Upvotes: 18

Related Questions