Reputation: 629
Is there an equivalent method to pandas info() method in PySpark?
I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe
Info() method in pandas provides all these statistics.
Upvotes: 13
Views: 27208
Reputation: 1376
I wrote a pyspark function that emulates Pandas.DataFrame.info()
from collections import Counter
def spark_info(df, abbreviate_columns=True, include_nested_types=False, count=None):
"""Similar to Pandas.DataFrame.info which produces output like:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 201100 entries, 0 to 201099
Columns: 151 entries, first_col to last_col
dtypes: float64(20), int64(6), object(50)
memory usage: 231.7+ MB
"""
classinfo = "<class 'pyspark.sql.dataframe.DataFrame'>"
_cnt = count if count else df.count()
numrows = f"Total Rows: {str(_cnt)}"
_cols = (
', to '.join([
df.columns[0], df.columns[-1]])
if abbreviate_columns
else ', '.join(df.columns))
columns = f"{len(df.columns)} entries: {_cols}"
_typs = [
col.dataType
for col in df.schema
if include_nested_types or (
'ArrayType' not in str(col.dataType) and
'StructType' not in str(col.dataType) and
'MapType' not in str(col.dataType))
]
dtypes = ', '.join(
f"{str(typ)}({cnt})"
for typ, cnt in Counter(_typs).items())
mem = f'memory usage: ? bytes'
return '\n'.join([classinfo, numrows, columns, dtypes, mem])
I wasn't sure about estimating size of pyspark dataframe. This depends on the full spark execution plan and configuration, but maybe try this answer for ideas.
Note that not all dtype summaries are included, by default nested types are excluded. Also df.count()
is calculated, which can take a while, unless you calculate it first and pass it in.
Suggested usage:
>>> df = spark.createDataFrame(((1, 'a', 2),(2,'b',3)), ['id', 'letter', 'num'])
>>> print(spark_info(df, count=2))
<class 'pyspark.sql.dataframe.DataFrame'>
Total Rows: 2
3 entries: id, to num
LongType(2), StringType(1)
memory usage: ? bytes
Upvotes: 3
Reputation: 126
Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.
From PySpark manual:
df.summary().show()
+-------+------------------+-----+
|summary| age| name|
+-------+------------------+-----+
| count| 2| 2|
| mean| 3.5| null|
| stddev|2.1213203435596424| null|
| min| 2|Alice|
| 25%| 2| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 5| Bob|
+-------+------------------+-----+
or
df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
| count| 2| 2|
+-------+---+----+
Upvotes: 8
Reputation: 529
To figure out type information about data frame you could try df.schema
spark.read.csv('matchCount.csv',header=True).printSchema()
StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))
For Summary stats you could also have a look at describe method from the documentation.
Upvotes: 5
Reputation: 71
Check this answer to get a count of the null and not null values.
from pyspark.sql.functions import isnan, when, count, col
import numpy as np
df = spark.createDataFrame(
[(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
('session', "timestamp1", "id2"))
df.show()
# +-------+----------+----+
# |session|timestamp1| id2|
# +-------+----------+----+
# | 1| 1|null|
# | 1| 2| 5.0|
# | 1| 3| NaN|
# | 2| 4|null|
# | 1| 5|10.0|
# | 1| 6| NaN|
# | 1| 6| NaN|
# +-------+----------+----+
df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 3|
# +-------+----------+---+
df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# | 0| 0| 5|
# +-------+----------+---+
df.describe().show()
# +-------+-------+------------------+---+
# |summary|session| timestamp1|id2|
# +-------+-------+------------------+---+
# | count| 7| 7| 5|
# | mean| 1.0| 3.857142857142857|NaN|
# | stddev| 0.0|1.9518001458970662|NaN|
# | min| 1| 1|5.0|
# | max| 1| 6|NaN|
# +-------+-------+------------------+---
There is no equivalent to pandas.DataFrame.info()
that I know of.
PrintSchema
is useful, and toPandas.info()
works for small dataframes but When I use pandas.DataFrame.info()
I often look at the null values.
Upvotes: 3
Reputation: 5565
I could not find a good answer so I use the slightly cheating
dataFrame.toPandas().info()
Upvotes: 4