Reputation: 792
Assume I have a dataframe of the following structure:
Name Desc Group ConditionType ConditionName
0 job1 desc1 group1 in cond1
1 job1 desc1 group1 in cond2
2 job1 desc1 group1 out cond1
3 job2 desc2 group1 in cond1
4 job2 desc2 group1 in cond2
5 job2 desc2 group1 in cond3
For the context, it's a full join of two tables; one with Name
, Desc
and Group
columns and the other with ConditionType
and ConditionName
columns. The tables are joined on the column Name
.
I would like to organize this DataFrame, such that I have the following structure:
Name Desc Group Conditions
0 job1 desc1 group1 {'in': ['cond1', 'cond2'], 'out': ['cond1']}
1 job2 desc2 group1 {'in': ['cond1', 'cond2', 'cond3'], 'out': []}
I am looking for a faster way to do this (possibly using groupby()
and apply()
) just not sure where to start. For-loop is really slow since the original table is over 100000 records.
Here's the code to instantiate both the initial table and the result that I seek:
table = [['job1', 'desc1', 'group1', 'in', 'cond1'],
['job1', 'desc1', 'group1', 'in', 'cond2'],
['job1', 'desc1', 'group1', 'out', 'cond1'],
['job2', 'desc2', 'group1', 'in', 'cond1'],
['job2', 'desc2', 'group1', 'in', 'cond2'],
['job2', 'desc2', 'group1', 'in', 'cond3']]
result = [['job1', 'desc1', 'group1', {'in': ['cond1', 'cond2'], 'out': ['cond1']}],
['job2', 'desc2', 'group1', {'in': ['cond1', 'cond2', 'cond3'], 'out': []}]]
table_df = pd.DataFrame(table, columns=['Name', 'Desc', 'Group', 'ConditionType', 'ConditionName'])
result_df = pd.DataFrame(result, columns=['Name', 'Desc', 'Group', 'Conditions'])
Upvotes: 1
Views: 62
Reputation: 164813
There will be a pandas
way, but this is the intuitive way via row iteration and collections.defaultdict
.
Note there are couple of minor adjustments you may wish to make (resetting index, adding empty dictionary items).
A groupby
method will not necessarily be faster. You should test with your data. For performance, use df.itertuples
instead of df.iterrows
.
from collections import defaultdict
d = defaultdict(lambda: defaultdict(list))
for idx, row in df.iterrows():
d[(row.Name, row.Desc, row.Group)][row.ConditionType].append(row.ConditionName)
df['Conditions'] = df.set_index(['Name', 'Desc', 'Group']).index.map(d.get)
df = df.loc[:, ['Name', 'Desc', 'Group', 'Conditions']]\
.drop_duplicates(subset=['Name', 'Desc', 'Group'])
print(df)
# Name Desc Group Conditions
# 0 job1 desc1 group1 {'in': ['cond1', 'cond2'], 'out': ['cond1']}
# 3 job2 desc2 group1 {'in': ['cond1', 'cond2', 'cond3']}
Upvotes: 2