kamashay
kamashay

Reputation: 93

SQL like summarize reports using Python Pandas

I often produce in R single statement summarize reports using dplyr as the following:

a <- group_by(data,x) 
b <- summarise(a, 
                 # count distinct y where value is not missing
                 y_distinct = n_distinct(y[is.na(y) == F]),
                 # count distinct z where value is not missing
                 z_distinct = n_distinct(z[is.na(z) == F]),
                 # count total number of values
                 total = n(),
                 # count y where value not missing
                 y_not_missing = length(y[is.na(y) == F]),
                 # count y where value is missing
                 y_missing = length(y[is.na(y) == T]))

which is similar to the way I would produce it in SQL:

select
    count(distinct(case when y is not null then y end)) as y_distinct,
    count(distinct(case when z is not null then z end)) as z_distinct,
    count(1) as total,
    count(case when y is not null then 1 end) as y_not_missing,
    count(case when z is not null then 1 end) as y_missing
from data group by x

however, I'm (Python newbie and) not able to find the Panda's equivalent, and got lost in the documentation. I'm able to produce each aggregation using a different groupby -> agg statements, but need help to produce the report in a single data frame (preferably using a single statement).

Upvotes: 1

Views: 521

Answers (3)

jezrael
jezrael

Reputation: 863166

I believe you need aggregation by agg with functions - size for count all values, count for count non NaNs values, nunique for count unique and custom function for count NaNs:

df = pd.DataFrame({'y':[4,np.nan,4,5,5,4],
                   'z':[np.nan,8,9,4,2,3],
                   'x':list('aaaabb')})

print (df)
   x    y    z
0  a  4.0  NaN
1  a  NaN  8.0
2  a  4.0  9.0
3  a  5.0  4.0
4  b  5.0  2.0
5  b  4.0  3.0



f = lambda x: x.isnull().sum()
f.__name__ = 'non nulls'
df = df.groupby('x').agg(['nunique', f, 'count', 'size'])
df.columns = df.columns.map('_'.join)
print (df)
   y_nunique  y_non nulls  y_count  y_size  z_nunique  z_non nulls  z_count  \
x                                                                             
a          2          1.0        3       4          3          1.0        3   
b          2          0.0        2       2          2          0.0        2   

   z_size  
x          
a       4  
b       2  

Upvotes: 2

O.Suleiman
O.Suleiman

Reputation: 918

If you already have the SQL query, you can use pandasql module to apply an SQL query directly on a pandas.DataFrame:

import pandasql as ps
query = """select
    count(distinct(case when y is not null then y end)) as y_distinct,
    count(distinct(case when z is not null then z end)) as z_distinct,
    count(1) as total,
    count(case when y is not null then 1 end) as y_not_missing,
    count(case when z is not null then 1 end) as y_missing
from df group by x""" #df here is the name of the DataFrame
ps.sqldf(query, locals())

Upvotes: 1

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210882

Try something like this:

In [18]: df
Out[18]:
   x    y    z
0  1  2.0  NaN
1  1  3.0  NaN
2  2  NaN  1.0
3  2  NaN  2.0
4  3  4.0  5.0

In [19]: def nulls(s):
    ...:     return s.isnull().sum()
    ...:

In [23]: r = df.groupby('x').agg(['nunique','size',nulls])

In [24]: r
Out[24]:
        y                  z
  nunique size nulls nunique size nulls
x
1       2    2   0.0       0    2   2.0
2       0    2   2.0       2    2   0.0
3       1    1   0.0       1    1   0.0

In order to flatten columns:

In [25]: r.columns = r.columns.map('_'.join)

In [26]: r
Out[26]:
   y_nunique  y_size  y_nulls  z_nunique  z_size  z_nulls
x
1          2       2      0.0          0       2      2.0
2          0       2      2.0          2       2      0.0
3          1       1      0.0          1       1      0.0

Upvotes: 4

Related Questions