Reputation: 7261
I am trying to create a generic function to aggregate a Pandas dataframe to a single row per part number.
The dataframe is being created from reading in a csv which is huge, it is 115,371 x 174
. For a minimal, complete, verifiable example, I have reduced the data to the table below:
| Part | Plant | Make_Buy |
|:----:|:-----:|:--------:|
| A | | Buy |
| A | XXX | Make |
| A | XXX | Buy |
| A | XXY | Buy |
| A | XXY | Make |
| B | YY | Make |
| B | XXY | Make |
Here also as Python code:
import pandas as pd
d = {'part': ['A', 'A', 'A', 'A', 'A', 'B', 'B'],
'plant': [None, 'XXX', 'XXX', 'XXY', 'XXY', 'YY', 'XXY'],
'make_buy': ['Buy', 'Make', 'Buy', 'Buy', 'Make', 'Make', 'Make']}
df = pd.DataFrame(data=d)
The ideal function would be able to do the following:
For any column that contains multiple values for a single part, aggregate them.
The ideal dataframe would look like:
| Part | Plant | Make_Buy |
|:----:|:--------:|:---------:|
| A | XXX, XXY | Buy, Make |
| B | YY, XXY | Make |
The issue is, there are 175 columns. So I need something that is generic enough to scan through each column, if there are multiple lines, apply the aggregation. I know in Oracle SQL
I can do:
SELECT
Part,
LISTAGG(col1, ', ') AS col1,
LISTAGG(col2, ', ') AS col2,
etc
FROM
table
I have looked into aggregation via text columns via the following post, but I don't know how to generically change:
df['part'] = df[['part','plant','make_buy']].groupby(['part'])['part'].transform(lambda x: ','.join(x))
df[['part','plant','make_buy']].drop_duplicates()
To every column
Upvotes: 1
Views: 174
Reputation: 88236
You can use sets
for this:
df.groupby('Part').agg(set)
Plant Make_Buy
Part
A {XXY, XXX} {Buy, Make}
B {YY, XXY} {Make}
To have the result as strings:
df.groupby('Part').agg(lambda x: ', '.join(set(x)))
Plant Make_Buy
Part
A XXY, XXX Buy, Make
B YY, XXY Make
Upvotes: 3