Reputation: 345
I am aware similar questions have been asked before (How to merge two rows in a dataframe pandas, etc), but I am still struggling to do the following (except with pandas dataframe with many rows):
team_token day1 day2 day3 day4
0 abc 1 NaN NaN NaN
1 abc NaN 1 NaN NaN
2 abc NaN NaN NaN NaN
3 abc NaN NaN NaN 1
I want to combine the rows with the same team_token so that the end result looks like:
team_token day1 day2 day3 day4
0 abc 1 1 NaN 1
Thank you in advance.
Upvotes: 6
Views: 10885
Reputation: 1
Suppose you want to make sure you're not losing information using the max method, because some rows are not duplicated even when disregarding the trivial case where one row has an NaN and another doesn't.
import pandas as pd
import numpy as np
df = pd.DataFrame({'team_token': ['abc', 'abc', 'abc', 'abc'],
'day1': [1, None, None, None],
'day2': [None, 1, None, None],
'day3': [None, None, None, 2],
'day4': [None, None, None, 1]})
# Assuming all values are numeric, the following method checks for these cases:
check = df.groupby('team_token').max() == df.groupby('team_token').min()
if not check.all().all():
raise ValueError('Using the groupby team_token with max loses information.')
Upvotes: 0
Reputation: 2163
Use combine_first
on two Series
given by the two rows of you dataframe:
import pandas as pd
df = pd.DataFrame({'team_token':['abc', 'abc'], 'day1': [1, None], 'day2' : [None, 1]})
df.loc[0].combine_first(df.loc[1])
gives:
team_token day1 day2
0 abc 1 1
#EDIT: A better solution that works also with your updated answer is simply:
df.max()
assuming that your day
columns contain either Nan or ones.
In case you have different team_tokens
:
day1 day2 day3 team_token
0 1.0 NaN NaN abc
1 NaN NaN NaN abc
2 1.0 1.0 NaN abc
3 NaN NaN NaN abc
4 1.0 NaN 1.0 ddd
5 1.0 NaN 1.0 ddd
6 NaN NaN 1.0 ddd
you can group_by
and still take the max
of the group:
df.groupby('team_token').max().reset_index()
# team_token day1 day2 day3
#0 abc 1.0 1.0 NaN
#1 ddd 1.0 NaN 1.0
Upvotes: 9
Reputation: 327
Maybe, not the shortest solution, but imho more flexible.
import pandas as pd
import numpy as np
df = pd.DataFrame({'team_token':['abc', 'abc', 'abc', 'abc'],
'day1': [1, None, None, None],
'day2': [None, 1, None, None],
'day3': [None, None, None, None],
'day4': [None, None, None, 1]})
df
def replace_nan_func(x):
x = x[~pd.isna(x)]
if len(x) > 0:
return x.iloc[0]
else:
return np.NaN
df_valid = df.groupby(by='team_token').agg(dict.fromkeys(df.columns[1:], replace_nan_func))
df_valid
Upvotes: 0
Reputation: 38415
If there is only one non-NaN value for each team_token, you can use groupby/sum
df.groupby('team_token')['day1', 'day2'].sum().reset_index()
You get
team_token day1 day2
0 abc 1.0 1.0
Upvotes: 0