Reputation: 93
I often produce in R single statement summarize reports using dplyr as the following:
a <- group_by(data,x)
b <- summarise(a,
# count distinct y where value is not missing
y_distinct = n_distinct(y[is.na(y) == F]),
# count distinct z where value is not missing
z_distinct = n_distinct(z[is.na(z) == F]),
# count total number of values
total = n(),
# count y where value not missing
y_not_missing = length(y[is.na(y) == F]),
# count y where value is missing
y_missing = length(y[is.na(y) == T]))
which is similar to the way I would produce it in SQL:
select
count(distinct(case when y is not null then y end)) as y_distinct,
count(distinct(case when z is not null then z end)) as z_distinct,
count(1) as total,
count(case when y is not null then 1 end) as y_not_missing,
count(case when z is not null then 1 end) as y_missing
from data group by x
however, I'm (Python newbie and) not able to find the Panda's equivalent, and got lost in the documentation. I'm able to produce each aggregation using a different groupby -> agg statements, but need help to produce the report in a single data frame (preferably using a single statement).
Upvotes: 1
Views: 521
Reputation: 863166
I believe you need aggregation by agg
with functions - size
for count all values, count
for count non NaN
s values, nunique
for count unique
and custom function for count NaN
s:
df = pd.DataFrame({'y':[4,np.nan,4,5,5,4],
'z':[np.nan,8,9,4,2,3],
'x':list('aaaabb')})
print (df)
x y z
0 a 4.0 NaN
1 a NaN 8.0
2 a 4.0 9.0
3 a 5.0 4.0
4 b 5.0 2.0
5 b 4.0 3.0
f = lambda x: x.isnull().sum()
f.__name__ = 'non nulls'
df = df.groupby('x').agg(['nunique', f, 'count', 'size'])
df.columns = df.columns.map('_'.join)
print (df)
y_nunique y_non nulls y_count y_size z_nunique z_non nulls z_count \
x
a 2 1.0 3 4 3 1.0 3
b 2 0.0 2 2 2 0.0 2
z_size
x
a 4
b 2
Upvotes: 2
Reputation: 918
If you already have the SQL query, you can use pandasql
module to apply an SQL
query directly on a pandas.DataFrame
:
import pandasql as ps
query = """select
count(distinct(case when y is not null then y end)) as y_distinct,
count(distinct(case when z is not null then z end)) as z_distinct,
count(1) as total,
count(case when y is not null then 1 end) as y_not_missing,
count(case when z is not null then 1 end) as y_missing
from df group by x""" #df here is the name of the DataFrame
ps.sqldf(query, locals())
Upvotes: 1
Reputation: 210882
Try something like this:
In [18]: df
Out[18]:
x y z
0 1 2.0 NaN
1 1 3.0 NaN
2 2 NaN 1.0
3 2 NaN 2.0
4 3 4.0 5.0
In [19]: def nulls(s):
...: return s.isnull().sum()
...:
In [23]: r = df.groupby('x').agg(['nunique','size',nulls])
In [24]: r
Out[24]:
y z
nunique size nulls nunique size nulls
x
1 2 2 0.0 0 2 2.0
2 0 2 2.0 2 2 0.0
3 1 1 0.0 1 1 0.0
In order to flatten columns:
In [25]: r.columns = r.columns.map('_'.join)
In [26]: r
Out[26]:
y_nunique y_size y_nulls z_nunique z_size z_nulls
x
1 2 2 0.0 0 2 2.0
2 0 2 2.0 2 2 0.0
3 1 1 0.0 1 1 0.0
Upvotes: 4