Reputation: 629

Pyspark: Is there an equivalent method to pandas info()?

Is there an equivalent method to pandas info() method in PySpark?

I am trying to gain basic statistics about a dataframe in PySpark, such as: Number of columns and rows Number of nulls Size of dataframe

Info() method in pandas provides all these statistics.

Upvotes: 13

Answers (5)

Wassadamo

Reputation: 1396

I wrote a pyspark function that emulates Pandas.DataFrame.info()

from collections import Counter

def spark_info(df, abbreviate_columns=True, include_nested_types=False, count=None):
    """Similar to Pandas.DataFrame.info which produces output like:
    <class 'pandas.core.frame.DataFrame'>
    RangeIndex: 201100 entries, 0 to 201099
    Columns: 151 entries, first_col to last_col
    dtypes: float64(20), int64(6), object(50)
    memory usage: 231.7+ MB
    """
    classinfo = "<class 'pyspark.sql.dataframe.DataFrame'>"

    _cnt = count if count else df.count()
    numrows = f"Total Rows: {str(_cnt)}"

    _cols = (
        ', to '.join([
            df.columns[0], df.columns[-1]]) 
        if abbreviate_columns 
        else ', '.join(df.columns))
    columns = f"{len(df.columns)} entries: {_cols}"

    _typs = [
        col.dataType 
        for col in df.schema 
        if include_nested_types or (
            'ArrayType' not in str(col.dataType) and 
            'StructType' not in str(col.dataType) and
            'MapType' not in str(col.dataType))
    ]
    dtypes = ', '.join(
        f"{str(typ)}({cnt})" 
        for typ, cnt in Counter(_typs).items())

    mem = f'memory usage: ? bytes'

    return '\n'.join([classinfo, numrows, columns, dtypes, mem])

I wasn't sure about estimating size of pyspark dataframe. This depends on the full spark execution plan and configuration, but maybe try this answer for ideas.

Note that not all dtype summaries are included, by default nested types are excluded. Also df.count() is calculated, which can take a while, unless you calculate it first and pass it in.

Suggested usage:

>>> df = spark.createDataFrame(((1, 'a', 2),(2,'b',3)), ['id', 'letter', 'num'])
>>> print(spark_info(df, count=2))

<class 'pyspark.sql.dataframe.DataFrame'>
Total Rows: 2
3 entries: id, to num
LongType(2), StringType(1)
memory usage: ? bytes

Upvotes: 3

danielfs88

Reputation: 126

Also there is summary method to get row numbers and some other descritive statistics. It is similar to describe method already mentioned.

From PySpark manual:

df.summary().show()
+-------+------------------+-----+
|summary|               age| name|
+-------+------------------+-----+
|  count|                 2|    2|
|   mean|               3.5| null|
| stddev|2.1213203435596424| null|
|    min|                 2|Alice|
|    25%|                 2| null|
|    50%|                 2| null|
|    75%|                 5| null|
|    max|                 5|  Bob|
+-------+------------------+-----+

or

df.select("age", "name").summary("count").show()
+-------+---+----+
|summary|age|name|
+-------+---+----+
|  count|  2|   2|
+-------+---+----+

Upvotes: 8

StackPointer

Reputation: 529

To figure out type information about data frame you could try df.schema

spark.read.csv('matchCount.csv',header=True).printSchema()

StructType(List(StructField(categ,StringType,true),StructField(minv,StringType,true),StructField(maxv,StringType,true),StructField(counts,StringType,true),StructField(cutoff,StringType,true)))

For Summary stats you could also have a look at describe method from the documentation.

Upvotes: 5

Daniel Fernandez

Reputation: 71

Check this answer to get a count of the null and not null values.

from pyspark.sql.functions import isnan, when, count, col
import numpy as np

df = spark.createDataFrame(
    [(1, 1, None), (1, 2, float(5)), (1, 3, np.nan), (1, 4, None), (1, 5, float(10)), (1, 6, float('nan')), (1, 6, float('nan'))],
    ('session', "timestamp1", "id2"))

df.show()
# +-------+----------+----+
# |session|timestamp1| id2|
# +-------+----------+----+
# |      1|         1|null|
# |      1|         2| 5.0|
# |      1|         3| NaN|
# |      2|         4|null|
# |      1|         5|10.0|
# |      1|         6| NaN|
# |      1|         6| NaN|
# +-------+----------+----+

df.select([count(when(isnan(c), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# |      0|         0|  3|
# +-------+----------+---+

df.select([count(when(isnan(c) | col(c).isNull(), c)).alias(c) for c in df.columns]).show()
# +-------+----------+---+
# |session|timestamp1|id2|
# +-------+----------+---+
# |      0|         0|  5|
# +-------+----------+---+

df.describe().show()
# +-------+-------+------------------+---+
# |summary|session|        timestamp1|id2|
# +-------+-------+------------------+---+
# |  count|      7|                 7|  5|
# |   mean|    1.0| 3.857142857142857|NaN|
# | stddev|    0.0|1.9518001458970662|NaN|
# |    min|      1|                 1|5.0|
# |    max|      1|                 6|NaN|
# +-------+-------+------------------+---

There is no equivalent to pandas.DataFrame.info() that I know of. PrintSchema is useful, and toPandas.info() works for small dataframes but When I use pandas.DataFrame.info() I often look at the null values.

Upvotes: 3

Rodney

Reputation: 5575

I could not find a good answer so I use the slightly cheating

dataFrame.toPandas().info()

Upvotes: 4

Pyspark: Is there an equivalent method to pandas info()?

Answers (5)

Related Questions