Reputation: 117
I have a DataFrame in pandas that looks like
memberid | ticketid | agegrp | c1 | c2 | c3 | d1 | date |
---|---|---|---|---|---|---|---|
1 | 1 | 1 | A | A | AA | 2019-01-01 | |
1 | 1 | 1 | A | A | AB | 2019-01-02 | |
1 | 1 | 1 | A | A | C | AC | 2019-01-01 |
1 | 2 | 1 | A | D | C | AA | 2019-02-01 |
1 | 2 | 1 | A | D | C | AC | 2019-02-01 |
2 | 3 | 4 | C | A | C | CA | 2019-03-01 |
2 | 3 | 4 | C | A | C | CD | 2019-03-01 |
2 | 3 | 4 | C | A | C | BB | 2019-03-01 |
2 | 3 | 4 | C | A | C | AA | 2019-03-02 |
df = pd.DataFrame( {
'memberiD': [1,1,1,1,1,2,2,2,2],
'ticketid': [1,1,1,2,2,3,3,3,3],
'agegrp': [1,1,1,1,4,4,4,4],
'c1': ['a','a','a','a','a','c','c','c','c'],
'c2': ['a','a','a','d','d','a','a','a','a'],
'c3': ['','','c','c','c','c','c','c','c'],
'd1': ['aa','ab','ac','aa','ac','ca','cd','bb','aa']
} );
I want to group by ticketid
so that one ticket id gets represented on exactly one row.
For each ticketid, memberid and agegrp should be exactly the same.
For the c1,c2,c3 in a ticketid, just pull the most-frequent distinct 3 that appear - in case of a tie, any of the top 3 is fine.
For all d1 in any one ticketid, take the most-frequent distinct 3 that appear, and insert them into columns d1,d2,d3 - similar to the c1,c2,c3 columns: if there is a tie, any of the top 3 is fine.
For date, just select the earliest date that appears for any ticketid.
So, a resulting dataframe could be:
memberid | ticketid | agegrp | c1 | c2 | c3 | d1 | d2 | d3 | date |
---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | A | C | AA | AB | AC | 2019-01-01 | |
1 | 2 | 1 | A | D | C | AA | AC | 2019-02-01 | |
1 | 3 | 4 | C | A | CA | CD | BB | 2019-03-01 |
I tried looking at indexing on ticketid, but I'm not exactly sure how to make the new columns with that indexing...although I'm not sure if this approach is correct in general.
Upvotes: 0
Views: 34
Reputation: 14949
Do you want this?
from statistics import mode
from collections import Counter
final_df =df.groupby('ticketid', as_index=False).agg({'memberid': mode,'c1':mode, 'c2': mode, 'c3': mode,'date': min,'d1': list})
final_df['d1'] = final_df.d1.apply(lambda x: ','.join(list(Counter(x))[:3]) if len(x) >= 3 else ','.join(x))
final_df[['d1','d2','d3']] = final_df['d1'].str.split(',', expand=True)
Output -
ticketid memberid c1 c2 c3 date d1 d2 d3
0 1 1 A A NaN 2019-01-01 AA AB AC
1 2 1 A D C 2019-02-01 AA AC None
2 3 2 C A C 2019-03-01 CA CD BB
Upvotes: 1