Stefano Pozzi
Stefano Pozzi

Reputation: 619

Aggregating Rows Pandas

I am quite new to pandas. I need to aggregate 'Names' if they have the same name and then make an average for 'Rating' and 'NumsHelpful' (without counting NaN). 'Review' should get concatenated whilst 'Weight(Pounds)'should remain untouched:

col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']

Name             'Brand'                             'Name'
1534             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1535             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1536             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1537             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1538             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1539             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   
1540             Zing Zang                Zing Zang Bloody Mary Mix, 32 fl oz   

        'NumsHelpful'     'Rating'       'Weight'
1534          NaN            2              4.5   
1535          NaN            2              4.5   
1536          NaN            NaN            4.5   
1537          NaN            NaN            4.5   
1538          2              NaN            4.5   
1539          3              5              4.5   
1540          5              NaN            4.5   

                        'Review'
1534                                     Yummy - Delish  
1535  The best Bloody Mary mix! - The best Bloody Ma...  
1536  Best Taste by far - I've tried several if not ...  
1537  Best bloody mary mix ever - This is also good ...  
1538  Outstanding - Has a small kick to it but very ...  
1539   OMG! So Good! - Spicy, terrific Bloody Mary mix!  
1540                      Good stuff - This is the best  

So the output should be something like this:

 'Brand'                'Name'                   'NumsHelpful'    'Rating' 
Zing Zang    Zing Zang Bloody Mary Mix, 32 fl oz     3.33             3

 'Weight'               'Review'
   4.5      Review1 / Review2 / ... / ReviewN

How shall I procede? Thanks.

Upvotes: 6

Views: 22023

Answers (3)

Carmo Melo
Carmo Melo

Reputation: 11

I've seen this happen because when creating the index you chose to keep the column in the list, usually the column goes to the index is excluded from the table, so do the following:

# dataset_A was created with the option # drop = False
df_dataset_new = dataset_A.copy()
index_df = ['month', 'scop']
# dataset_new will be create`enter code here`d with the option # drop = True
df_dataset_new.set_index(index_df, drop=True, inplace=True, verify_integrity=True) 

Upvotes: 0

jezrael
jezrael

Reputation: 862511

Use DataFrameGroupBy.agg with dictionary of columns and aggregated functions - columns Weight and Brand are agregated by first - it means first values per groups:

d = {'NumsHelpful':'mean', 
     'Review':'/'.join, 
     'Weight':'first',
     'Brand':'first', 
     'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
                                  Name  NumsHelpful  \
0  Zing Zang Bloody Mary Mix, 32 fl oz     3.333333   

                                              Review  Weight      Brand  \
0  Yummy - Delish/The best Bloody Mary mix! - The...     4.5  Zing Zang   

   Rating  
0     3.0  

Also in pandas 0.23.1 pandas version get:

FutureWarning: 'Name' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version

Solution is remove index name Name:

df.index.name = None

Or:

df = df.rename_axis(None)

Another possible solution is not aggregate by first, but add these column to groupby:

d = {'NumsHelpful':'mean',  'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()

Both solutions return same output if per groups there are same values.

EDIT:

If need convert string (object) column to numeric first try convert by astype:

df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)

And if it failed use to_numeric with parameter errors='coerce' for convert non parseable strings to NaNs:

df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')

Upvotes: 10

jpp
jpp

Reputation: 164623

You can aggregate with a different function for each column using groupby + agg, together with a dictionary mapping series to functions. For example:

d = {'Rating': 'mean',
     'NumsHelpful': 'mean',
     'Review': ' | '.join,
     'Weight(Pounds)': 'first'}

res = df.groupby('Name').agg(d)

Upvotes: 2

Related Questions