Reputation: 155
I have two dataframes with identical columns. However the 'labels' column can have different labels. All labels are comma seperated strings. I want to make a union on the labels in order to go from this:
df1:
id1 id2 labels language
0 1 1 1 en
1 2 3 en
2 3 4 4 en
3 4 5 en
4 5 6 en
df2:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 5,7 en
3 4 5 en
4 5 6 3 en
to this:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 3 en
I've tried this:
df1['labels'] = df1['labels'].apply(lambda x: set(str(x).split(',')))
df2['labels'] = df2['labels'].apply(lambda x: set(str(x).split(',')))
result = df1.merge(df2, on=['article_id', 'line_number', 'language'], how='outer')
result['labels'] = result[['labels_x', 'labels_y']].apply(lambda x: list(set.union(*x)) if None not in x else set(), axis=1)
result['labels'] = result['labels'].apply(lambda x: ','.join(set(x)))
result = result.drop(['labels_x', 'techniques_y'], axis=1)
but I get a wierd df with odd commas in some places, e.g the ,3
.:
id1 id2 labels language
0 1 1 1,2 en
1 2 3 en
2 3 4 4,5,7 en
3 4 5 en
4 5 6 ,3 en
How can I properly fix the commas? Any help is appreciated!
Upvotes: 0
Views: 136
Reputation: 37827
Here is a possible solution with pandas.merge
:
out = (
df1.merge(df2, on=["id1", "id2", "language"])
.assign(labels= lambda x: x.filter(like="label")
.stack().str.split(",")
.explode().drop_duplicates()
.groupby(level=0).agg(",".join))
.drop(columns=["labels_x", "labels_y"])
[df1.columns]
)
Output :
print(out)
id1 id2 labels language
0 1 1 1,2 en
1 2 3 NaN en
2 3 4 4,5,7 en
3 4 5 NaN en
4 5 6 3 en
Upvotes: 1