Pandas GroupBy Concatenation Issues

Question

I have a data table that contains NAME, DATE, TYPE, VALUE A, VALUE B, VALUE C, ....

Theoretically the NAME, DATE, TYPE are supposed to be unique, but it turns out they are not, so I need to figure out how to handle it.

Example Input:

Bob, 1/1/18, AType, blah, test, test2
Bob, 1/1/18, AType, blah2, test, test2
Bob, 1/1/18, BType, blah, test, test2

Ideal Output

Bob, 1/1/18, AType, [blah,blah2], test, test2
Bob, 1/1/18, BType, blah, test, test2

My goal is to merge the VALUE A values into a JSON style array, and then take the last value for VALUE B and VALUE C.

I have tried:

data.groupby('NAME', 'DATE', 'TYPE').apply(lambda x: ','.join(set(x)))

But that merges ALL of the fields.... Or seems to.

Any help?

jpp · Accepted Answer

This seems to work:

import pandas as pd

df = pd.DataFrame([['Bob', '1/1/18', 'AType', 'blah', 'test', 'test2'],
                   ['Bob', '1/1/18', 'AType', 'blah2', 'test', 'test3'],
                   ['Bob', '1/1/18', 'BType', 'blah', 'test', 'test2']],
                  columns=['NAME', 'DATE', 'TYPE', 'VALUE A', 'VALUE B', 'VALUE C'])

f = {'VALUE A': lambda x: set(x), 'VALUE B': 'last', 'VALUE C': 'last'}

df = df.groupby(['NAME', 'DATE', 'TYPE'])['VALUE A', 'VALUE B', 'VALUE C']\
       .agg(f).reset_index()
df['VALUE A'] = df['VALUE A'].map(list)

#   NAME    DATE   TYPE        VALUE A VALUE B VALUE C
# 0  Bob  1/1/18  AType  [blah2, blah]    test   test3
# 1  Bob  1/1/18  BType         [blah]    test   test2

Pandas GroupBy Concatenation Issues

Answers (1)

Related Questions