Reputation: 20175
Given the following data frame
df = pd.DataFrame({
'identifier': ['1', '2', None],
'name': ['Tom', 'Peter', 'Peter'],
'registered': [True, False, True]
})
the ultimate goal is to merge the data frame grouped by the name and according to certain rules, e.g.
identifier
is a string and the other is None
, then use the string identifier
or
to all registered
entriesSo the result should look like
df_result = pd.DataFrame({
'identifier': ['1', '2'],
'name': ['Tom', 'Peter'],
'registered': [True, True]
})
I tried it with groupby
, but maybe this is the wrong way at all?
drop_duplicates
do not let me to add specific rules.
Upvotes: 3
Views: 63
Reputation: 863291
I think you need custom function with dropna
, drop_duplicates
and any
:
df = pd.DataFrame({
'identifier': ['1', '2', None, '2'],
'name': ['Peter', 'Peter', 'Peter', 'Peter'],
'registered': [True, False, True, True]
})
print (df)
identifier name registered
0 1 Peter True
1 2 Peter False
2 None Peter True
3 2 Peter True
def f(x):
return pd.DataFrame({'identifier': x['identifier'].dropna().drop_duplicates(),
'registered': x['registered'].any()})
df = df.groupby('name').apply(f).reset_index(level=1, drop=True).reset_index()
print (df)
name identifier registered
0 Peter 1 True
1 Peter 2 True
Upvotes: 1
Reputation: 402872
Let's modify your data slightly.
df = pd.DataFrame({
'identifier': ['1', None, '2'],
'name': ['Tom', 'Peter', 'Peter'],
'registered': [True, False, True]
})
df
identifier name registered
0 1 Tom True
1 None Peter False
2 2 Peter True
A "None" is the first identifier for "Peter". You can remedy this with a sort_values
call, following which, you call groupby
+ agg
.
df.sort_values(['identifier'])\
.groupby('name', as_index=False)\
.agg({'identifier' : 'first', 'registered' : any})
name registered identifier
0 Peter True 2
1 Tom True 1
Upvotes: 1