Alejandro A
Alejandro A

Reputation: 1190

Group by to list multiple cols without nan (or any value)

Please note this is just an example, there are more columns in the example and the list ends up being very big, hence I don't want to iterate it twice

Having:

import pandas as pd
import numpy as np
data = pd.DataFrame({'Name':['Peter','Peter','Anna','Anna','Anna'],
                     'Country1':['Italy',np.nan,np.nan,'Sweden',np.nan],
                     'Country2':[np.nan,'Venezuela',np.nan,'Peru','Iceland'],
                     'Price':[12,33,45,6,9]})

I do

data_g_name = data.groupby('Name')
country_cols=['Country1','Country2']
g_stats = pd.DataFrame({
    'Countries':data_g_name['Country1','Country2'].apply(lambda x:x.values.flatten().tolist()),
    'TotalCost' : data_g_name['Price'].sum()
})

And obtain:

'    Name                               Countries  TotalCost
0   Anna  [nan, nan, Sweden, Peru, nan, Iceland]         60
1  Peter            [Italy, nan, nan, Venezuela]         45'

I would like (without having to iterate through the list if possible, real case list is big):

     Name           Countries        TotalCost
0    Anna            [Sweden,Peru,Iceland]          60
1    Peter           [Italy,Venezuela]           45

Upvotes: 0

Views: 105

Answers (2)

Corralien
Corralien

Reputation: 120391

Use melt to unpivot dataframe, drop all row with NaN in column 'Country', group by 'Name' and convert to list then join the sum of 'Price':

>>> df.melt(['Name', 'Price'], value_name='Country') \
      .dropna(subset=['Country']).groupby('Name')['Country'] \
      .apply(list).to_frame() \
      .join(df.groupby('Name')['Price'].sum().rename('TotalCost'))

                       Country  TotalCost
Name
Anna   [Sweden, Peru, Iceland]         60
Peter       [Italy, Venezuela]         45

Upvotes: 1

akuiper
akuiper

Reputation: 214917

Aggregate country and cost separately and then combine results:

cost = data.Price.groupby(data.Name).sum().rename('TotalCost')
countries = (
  data.melt('Name', ['Country1', 'Country2'], value_name='Countries')
   .dropna()
   .groupby('Name')
   .Countries
   .agg(list))

pd.concat([countries, cost], 1).reset_index()

#    Name                Countries  TotalCost
#0   Anna  [Sweden, Peru, Iceland]         60
#1  Peter       [Italy, Venezuela]         45

Upvotes: 1

Related Questions