Reputation: 771
I have a dataframe that looks like this:
data = [
["AB", np.nan, 134, "ID1"],
["AB", np.nan, 252, "ID1"],
["BC", np.nan, 56, "ID2"],
["CD", np.nan, 159, "ID1"],
["DE", 32, np.nan, "ID3"],
]
df = pd.DataFrame(data, columns=["method", "var_1", "var_2", "ID"])
df
I am trying to get a count of unique IDs grouped by method which have missing values for var_1
and var_2
and am unable to find a way to do this.
I have been able to get counts of missing data using count()
and size()
and subtracting one from the other, but unfortunately I really need counts of unique IDs. It seems so simple I feel as though I must be missing something obvious!
My desired output is:
Where we are grouping by method and counting the number of unique IDs missing information for the other columns.
method var_1 var_2
AB 1 0
BC 1 0
CD 1 0
DE 0 1
Upvotes: 3
Views: 479
Reputation: 59579
To count unique IDs, check where it's null then max
within [ID, method], to indicate any missing value within that [ID, method]. Then sum over the method to get the Number of unique IDS missing something.
(df[['var_1', 'var_2']].isnull()
.groupby([df['ID'], df['method']]).max()
.sum(level='method')
var_1 var_2
method
AB 1 0
CD 1 0
BC 1 0
DE 0 1
Upvotes: 2
Reputation: 323356
In your case
g = df.groupby(['ID','method'])
out = g[['var_1','var_2']].apply(lambda x : x.isnull().sum())
var_1 var_2
ID method
ID1 AB 1 0
CD 1 0
ID2 BC 1 0
ID3 DE 0 1
Upvotes: 4