Dianne
Dianne

Reputation: 73

Use a list of column names in groupby agg?

Summary:

I have many columns with data (data_cols) and with text (text_cols), I'd like to do the following by referring to a list of column names, but cannot figure it out: df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})

Explanation:

I have a dataframe with ~30 columns, some of the columns contain values and the others contain text. I would like to use sum all values with the same id and for the text to use the first entry. I can achieve this by using groupby:

d = {'id': ['a', 'a', 'b', 'c'], 'value1': [1, 2, 3, 4], 'value2': [5, 6, 7, 8], 'text1': ['w', 'x', 'y', 'z']}
df = pd.DataFrame(d)

  id  value1  value2 text1
0  a       1       5     w
1  a       2       6     x
2  b       3       7     y
3  c       4       8     z

df.groupby('id', as_index=False).agg({'value1': 'sum', 'value2': 'sum', 'text1': 'first'})

  id  value1  value2 text1
0  a       3      11     w
1  b       3       7     y
2  c       4       8     z

This is exactly what I would like to achieve, except that I have many columns and would prefer not to write out all column names. So I have tried different things to reference multiple columns, but I don't get any to work.

data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)

df.groupby('id', as_index=False).agg({data_cols: 'sum', text_cols: 'first'})

Here, I get TypeError: unhashable type: 'Index', so I thought I could avoid this by using tuple:

data_cols = tuple(df.columns[1:3])
text_cols = tuple(set(df.columns) - set(data_cols)) #I have many data columns and some text columbs before and after them

Which gives me: SpecificationError: Column(s) [('text1', 'id'), ('value1', 'value2')] do not exist I think it views ('value1', 'value2') as a single column, instead of unpacking it into the two columns. Is there any way or format to avoid this?

Alternatively I can split the df into 2 dfs: df_values and df_text, do the grouping on df_values and then join them back together, but this seems cumbersome and I imagine there to be a better way.

Upvotes: 5

Views: 5689

Answers (3)

Mohseen Mulla
Mohseen Mulla

Reputation: 602

If you have the columns as a list. I would highly recommend using the following method -

df.groupby('col', as_index=False)[columns_list].agg('sum')

P.S - The as_index=False is an important parameter

Worked like a charm for me!

Upvotes: 0

jezrael
jezrael

Reputation: 863256

Create dictionaries by dict.fromkeys and merge them, last pass to agg:

data_cols = df.columns[1:3]
text_cols = set(df.columns) - set(data_cols)
d1 = dict.fromkeys(data_cols, 'sum')
d2 = dict.fromkeys(text_cols, 'first')

#https://stackoverflow.com/questions/38987
d = {**d1, **d2}

df.groupby('id', as_index=False).agg(d)

Upvotes: 5

Allen Qin
Allen Qin

Reputation: 19957

You need to first convert the column names to a dict with desired operations and then merge the dict.

(
    df.groupby('id', as_index=False)
    .agg({**{e:'sum' for e in data_cols}, **{e:'first' for e in text_cols}})
)

Upvotes: 4

Related Questions