cgp25
cgp25

Reputation: 345

Merging Two Rows (one with a value, the other NaN) in Pandas

I am aware similar questions have been asked before (How to merge two rows in a dataframe pandas, etc), but I am still struggling to do the following (except with pandas dataframe with many rows):

    team_token  day1    day2   day3  day4
0   abc          1      NaN     NaN   NaN
1   abc          NaN     1      NaN   NaN
2   abc          NaN     NaN    NaN    NaN
3   abc          NaN     NaN    NaN     1

I want to combine the rows with the same team_token so that the end result looks like:

    team_token  day1    day2  day3  day4
0   abc           1      1    NaN     1

Thank you in advance.

Upvotes: 6

Views: 10885

Answers (4)

Oren
Oren

Reputation: 1

Suppose you want to make sure you're not losing information using the max method, because some rows are not duplicated even when disregarding the trivial case where one row has an NaN and another doesn't.

import pandas as pd
import numpy as np

df = pd.DataFrame({'team_token': ['abc', 'abc', 'abc', 'abc'],
                   'day1': [1, None, None, None],
                   'day2': [None, 1, None, None],
                   'day3': [None, None, None, 2],
                   'day4': [None, None, None, 1]})

# Assuming all values are numeric, the following method checks for these cases:
check = df.groupby('team_token').max() == df.groupby('team_token').min()
if not check.all().all():
    raise ValueError('Using the groupby team_token with max loses information.')

Upvotes: 0

VinceP
VinceP

Reputation: 2163

Use combine_first on two Series given by the two rows of you dataframe:

import pandas as pd
df = pd.DataFrame({'team_token':['abc', 'abc'], 'day1': [1, None], 'day2' : [None, 1]})

df.loc[0].combine_first(df.loc[1])

gives:

 team_token  day1    day2
0   abc           1       1 

#EDIT: A better solution that works also with your updated answer is simply:

df.max()

assuming that your day columns contain either Nan or ones.

In case you have different team_tokens:

   day1  day2  day3 team_token
0   1.0   NaN   NaN        abc
1   NaN   NaN   NaN        abc
2   1.0   1.0   NaN        abc
3   NaN   NaN   NaN        abc
4   1.0   NaN   1.0        ddd
5   1.0   NaN   1.0        ddd
6   NaN   NaN   1.0        ddd

you can group_by and still take the max of the group:

df.groupby('team_token').max().reset_index()

#  team_token  day1  day2  day3
#0        abc   1.0   1.0   NaN
#1        ddd   1.0   NaN   1.0

Upvotes: 9

Kirill Dolmatov
Kirill Dolmatov

Reputation: 327

Maybe, not the shortest solution, but imho more flexible.

import pandas as pd
import numpy as np

df = pd.DataFrame({'team_token':['abc', 'abc', 'abc', 'abc'], 
                   'day1': [1, None, None, None], 
                   'day2': [None, 1, None, None],
                   'day3': [None, None, None, None],
                   'day4': [None, None, None, 1]})
df

enter image description here

def replace_nan_func(x):
    x = x[~pd.isna(x)]
    if len(x) > 0:
        return x.iloc[0]
    else:
        return np.NaN

df_valid = df.groupby(by='team_token').agg(dict.fromkeys(df.columns[1:], replace_nan_func))
df_valid

enter image description here

Upvotes: 0

Vaishali
Vaishali

Reputation: 38415

If there is only one non-NaN value for each team_token, you can use groupby/sum

df.groupby('team_token')['day1', 'day2'].sum().reset_index()

You get

    team_token  day1    day2
0   abc         1.0     1.0

Upvotes: 0

Related Questions