Reputation: 73
I have many columns with data (data_cols) and with text (text_cols), I'd like to do the following by referring to a list of column names, but cannot figure it out:
df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})
I have a dataframe with ~30 columns, some of the columns contain values and the others contain text. I would like to use sum all values with the same id and for the text to use the first entry. I can achieve this by using groupby:
d = {'id': ['a', 'a', 'b', 'c'], 'value1': [1, 2, 3, 4], 'value2': [5, 6, 7, 8], 'text1': ['w', 'x', 'y', 'z']}
df = pd.DataFrame(d)
id value1 value2 text1
0 a 1 5 w
1 a 2 6 x
2 b 3 7 y
3 c 4 8 z
df.groupby('id', as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'text1': 'first'})
id value1 value2 text1
0 a 3 11 w
1 b 3 7 y
2 c 4 8 z
This is exactly what I would like to achieve, except that I have many columns and would prefer not to write out all column names. So I have tried different things to reference multiple columns, but I don't get any to work.
data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)
df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})
Here, I get TypeError: unhashable type: 'Index'
, so I thought I could avoid this by using tuple
:
data_cols = tuple(df.columns[1:3])
text_cols = tuple(set(df.columns) - set(data_cols)) #I have many data columns and some text columbs before and after them
Which gives me: SpecificationError: Column(s) [('text1', 'id'), ('value1', 'value2')] do not exist
I think it views ('value1', 'value2')
as a single column, instead of unpacking it into the two columns. Is there any way or format to avoid this?
Alternatively I can split the df into 2 dfs: df_values and df_text, do the grouping on df_values and then join them back together, but this seems cumbersome and I imagine there to be a better way.
Upvotes: 5
Views: 5689
Reputation: 602
If you have the columns as a list. I would highly recommend using the following method -
df.groupby('col', as_index=False)[columns_list].agg('sum')
P.S - The as_index=False is an important parameter
Worked like a charm for me!
Upvotes: 0
Reputation: 863256
Create dictionaries by dict.fromkeys
and merge them, last pass to agg
:
data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)
d1 = dict.fromkeys(data_cols, 'sum')
d2 = dict.fromkeys(text_cols, 'first')
#https://stackoverflow.com/questions/38987
d = {**d1, **d2}
df.groupby('id', as_index=False).agg(d)
Upvotes: 5
Reputation: 19957
You need to first convert the column names to a dict with desired operations and then merge the dict.
(
df.groupby('id', as_index=False)
.agg({**{e:'sum' for e in data_cols}, **{e:'first' for e in text_cols}})
)
Upvotes: 4