Group by to list multiple cols without nan (or any value)

Question

Please note this is just an example, there are more columns in the example and the list ends up being very big, hence I don't want to iterate it twice

Having:

import pandas as pd
import numpy as np
data = pd.DataFrame({'Name':['Peter','Peter','Anna','Anna','Anna'],
                     'Country1':['Italy',np.nan,np.nan,'Sweden',np.nan],
                     'Country2':[np.nan,'Venezuela',np.nan,'Peru','Iceland'],
                     'Price':[12,33,45,6,9]})

I do

data_g_name = data.groupby('Name')
country_cols=['Country1','Country2']
g_stats = pd.DataFrame({
    'Countries':data_g_name['Country1','Country2'].apply(lambda x:x.values.flatten().tolist()),
    'TotalCost' : data_g_name['Price'].sum()
})

And obtain:

'    Name                               Countries  TotalCost
0   Anna  [nan, nan, Sweden, Peru, nan, Iceland]         60
1  Peter            [Italy, nan, nan, Venezuela]         45'

I would like (without having to iterate through the list if possible, real case list is big):

     Name           Countries        TotalCost
0    Anna            [Sweden,Peru,Iceland]          60
1    Peter           [Italy,Venezuela]           45

Corralien · Accepted Answer

Use melt to unpivot dataframe, drop all row with NaN in column 'Country', group by 'Name' and convert to list then join the sum of 'Price':

>>> df.melt(['Name', 'Price'], value_name='Country') \
      .dropna(subset=['Country']).groupby('Name')['Country'] \
      .apply(list).to_frame() \
      .join(df.groupby('Name')['Price'].sum().rename('TotalCost'))

                       Country  TotalCost
Name
Anna   [Sweden, Peru, Iceland]         60
Peter       [Italy, Venezuela]         45

Group by to list multiple cols without nan (or any value)

Answers (2)

Related Questions