Reputation: 619
I am quite new to pandas
. I need to aggregate 'Names'
if they have the same name and then make an average for 'Rating'
and 'NumsHelpful'
(without counting NaN
). 'Review'
should get concatenated whilst 'Weight(Pounds)'
should remain untouched:
col names: ['Brand', 'Name', 'NumsHelpful', 'Rating', 'Weight(Pounds)', 'Review']
Name 'Brand' 'Name'
1534 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1535 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1536 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1537 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1538 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1539 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
1540 Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz
'NumsHelpful' 'Rating' 'Weight'
1534 NaN 2 4.5
1535 NaN 2 4.5
1536 NaN NaN 4.5
1537 NaN NaN 4.5
1538 2 NaN 4.5
1539 3 5 4.5
1540 5 NaN 4.5
'Review'
1534 Yummy - Delish
1535 The best Bloody Mary mix! - The best Bloody Ma...
1536 Best Taste by far - I've tried several if not ...
1537 Best bloody mary mix ever - This is also good ...
1538 Outstanding - Has a small kick to it but very ...
1539 OMG! So Good! - Spicy, terrific Bloody Mary mix!
1540 Good stuff - This is the best
So the output should be something like this:
'Brand' 'Name' 'NumsHelpful' 'Rating'
Zing Zang Zing Zang Bloody Mary Mix, 32 fl oz 3.33 3
'Weight' 'Review'
4.5 Review1 / Review2 / ... / ReviewN
How shall I procede? Thanks.
Upvotes: 6
Views: 22023
Reputation: 11
I've seen this happen because when creating the index you chose to keep the column in the list, usually the column goes to the index is excluded from the table, so do the following:
# dataset_A was created with the option # drop = False
df_dataset_new = dataset_A.copy()
index_df = ['month', 'scop']
# dataset_new will be create`enter code here`d with the option # drop = True
df_dataset_new.set_index(index_df, drop=True, inplace=True, verify_integrity=True)
Upvotes: 0
Reputation: 862511
Use DataFrameGroupBy.agg
with dictionary of columns and aggregated functions - columns Weight
and Brand
are agregated by first
- it means first values per groups:
d = {'NumsHelpful':'mean',
'Review':'/'.join,
'Weight':'first',
'Brand':'first',
'Rating':'mean'}
df = df.groupby('Name').agg(d).reset_index()
print (df)
Name NumsHelpful \
0 Zing Zang Bloody Mary Mix, 32 fl oz 3.333333
Review Weight Brand \
0 Yummy - Delish/The best Bloody Mary mix! - The... 4.5 Zing Zang
Rating
0 3.0
Also in pandas 0.23.1 pandas version get:
FutureWarning: 'Name' is both an index level and a column label. Defaulting to column, but this will raise an ambiguity error in a future version
Solution is remove index name Name
:
df.index.name = None
Or:
df = df.rename_axis(None)
Another possible solution is not aggregate by first
, but add these column to groupby
:
d = {'NumsHelpful':'mean', 'Review':'/'.join, 'Rating':'mean'}
df = df.groupby(['Name', 'Weight','Brand']).agg(d).reset_index()
Both solutions return same output if per groups there are same values.
EDIT:
If need convert string (object) column to numeric first try convert by astype
:
df['Weight(Pounds)'] = df['Weight(Pounds)'].astype(float)
And if it failed use to_numeric
with parameter errors='coerce'
for convert non parseable strings to NaN
s:
df['Weight(Pounds)'] = pd.to_numeric(df['Weight(Pounds)'], errors='coerce')
Upvotes: 10
Reputation: 164623
You can aggregate with a different function for each column using groupby
+ agg
, together with a dictionary mapping series to functions. For example:
d = {'Rating': 'mean',
'NumsHelpful': 'mean',
'Review': ' | '.join,
'Weight(Pounds)': 'first'}
res = df.groupby('Name').agg(d)
Upvotes: 2