Reputation: 57
I have a data table that contains NAME, DATE, TYPE, VALUE A, VALUE B, VALUE C, ....
Theoretically the NAME, DATE, TYPE are supposed to be unique, but it turns out they are not, so I need to figure out how to handle it.
Example Input:
Bob, 1/1/18, AType, blah, test, test2
Bob, 1/1/18, AType, blah2, test, test2
Bob, 1/1/18, BType, blah, test, test2
Ideal Output
Bob, 1/1/18, AType, [blah,blah2], test, test2
Bob, 1/1/18, BType, blah, test, test2
My goal is to merge the VALUE A values into a JSON style array, and then take the last value for VALUE B and VALUE C.
I have tried:
data.groupby('NAME', 'DATE', 'TYPE').apply(lambda x: ','.join(set(x)))
But that merges ALL of the fields.... Or seems to.
Any help?
Upvotes: 1
Views: 48
Reputation: 164783
This seems to work:
import pandas as pd
df = pd.DataFrame([['Bob', '1/1/18', 'AType', 'blah', 'test', 'test2'],
['Bob', '1/1/18', 'AType', 'blah2', 'test', 'test3'],
['Bob', '1/1/18', 'BType', 'blah', 'test', 'test2']],
columns=['NAME', 'DATE', 'TYPE', 'VALUE A', 'VALUE B', 'VALUE C'])
f = {'VALUE A': lambda x: set(x), 'VALUE B': 'last', 'VALUE C': 'last'}
df = df.groupby(['NAME', 'DATE', 'TYPE'])['VALUE A', 'VALUE B', 'VALUE C']\
.agg(f).reset_index()
df['VALUE A'] = df['VALUE A'].map(list)
# NAME DATE TYPE VALUE A VALUE B VALUE C
# 0 Bob 1/1/18 AType [blah2, blah] test test3
# 1 Bob 1/1/18 BType [blah] test test2
Upvotes: 1