Max Mikhaylov
Max Mikhaylov

Reputation: 792

Grouping column values into a dictionary

Assume I have a dataframe of the following structure:

   Name   Desc   Group ConditionType ConditionName
0  job1  desc1  group1            in         cond1
1  job1  desc1  group1            in         cond2
2  job1  desc1  group1           out         cond1
3  job2  desc2  group1            in         cond1
4  job2  desc2  group1            in         cond2
5  job2  desc2  group1            in         cond3

For the context, it's a full join of two tables; one with Name, Desc and Group columns and the other with ConditionType and ConditionName columns. The tables are joined on the column Name.

I would like to organize this DataFrame, such that I have the following structure:

   Name   Desc   Group                                      Conditions
0  job1  desc1  group1    {'in': ['cond1', 'cond2'], 'out': ['cond1']}
1  job2  desc2  group1  {'in': ['cond1', 'cond2', 'cond3'], 'out': []}

I am looking for a faster way to do this (possibly using groupby() and apply()) just not sure where to start. For-loop is really slow since the original table is over 100000 records.

Here's the code to instantiate both the initial table and the result that I seek:

table = [['job1', 'desc1', 'group1', 'in', 'cond1'],
         ['job1', 'desc1', 'group1', 'in', 'cond2'],
         ['job1', 'desc1', 'group1', 'out', 'cond1'],
         ['job2', 'desc2', 'group1', 'in', 'cond1'],
         ['job2', 'desc2', 'group1', 'in', 'cond2'],
         ['job2', 'desc2', 'group1', 'in', 'cond3']]

result = [['job1', 'desc1', 'group1', {'in': ['cond1', 'cond2'], 'out': ['cond1']}],
          ['job2', 'desc2', 'group1', {'in': ['cond1', 'cond2', 'cond3'], 'out': []}]]

table_df = pd.DataFrame(table, columns=['Name', 'Desc', 'Group', 'ConditionType', 'ConditionName'])
result_df = pd.DataFrame(result, columns=['Name', 'Desc', 'Group', 'Conditions'])

Upvotes: 1

Views: 62

Answers (1)

jpp
jpp

Reputation: 164813

There will be a pandas way, but this is the intuitive way via row iteration and collections.defaultdict.

Note there are couple of minor adjustments you may wish to make (resetting index, adding empty dictionary items).

A groupby method will not necessarily be faster. You should test with your data. For performance, use df.itertuples instead of df.iterrows.

from collections import defaultdict

d = defaultdict(lambda: defaultdict(list))

for idx, row in df.iterrows():
    d[(row.Name, row.Desc, row.Group)][row.ConditionType].append(row.ConditionName)

df['Conditions'] = df.set_index(['Name', 'Desc', 'Group']).index.map(d.get)

df = df.loc[:, ['Name', 'Desc', 'Group', 'Conditions']]\
       .drop_duplicates(subset=['Name', 'Desc', 'Group'])

print(df)

#    Name   Desc   Group                                    Conditions
# 0  job1  desc1  group1  {'in': ['cond1', 'cond2'], 'out': ['cond1']}
# 3  job2  desc2  group1           {'in': ['cond1', 'cond2', 'cond3']}

Upvotes: 2

Related Questions