Reputation: 21
Consider the following example:
I have a table of emails, each with an email id, and two label columns, generated through different code paths, containing a list of labels associated with those emails.
df = pd.DataFrame({
'id': [1,2,3,4],
'labels1': [np.array(['red']), np.array(['blue', 'green']), np.array(['blue']), np.nan],
'labels2': [np.nan, np.nan, np.array(['yellow', 'purple']), np.array(['magenta'])]
})
df
id labels1 labels2
0 1 [red] NaN
1 2 [blue, green] NaN
2 3 [blue] [yellow, purple]
3 4 NaN [magenta]
So, I need a way to produce the following DataFrame:
df_merge
id labels
0 1 [red]
1 2 [blue, green]
2 3 [blue, yellow, purple]
3 4 [magenta]
But using lambda functions as I might do with scalar column data throws a ValueError exception:
df.apply(lambda x: np.unique(np.append(x['labels1'], x['labels2'])), axis=1)
ValueError: Shape of passed values is (4, 2), indices imply (4, 4)
I've tried many different variations on the above, all to no avail. I'm wondering if perhaps array-like column data like this is a pandas anti-pattern, and if so, what are better approaches?
Upvotes: 2
Views: 690
Reputation: 294488
NaN
into []
using applymap
sum
across rowsdf[['id']].assign(
labels=labels.applymap(lambda x: x if isinstance(x, list) else []).sum(1)
)
id labels
0 1 [red]
1 2 [blue, green]
2 3 [blue, yellow, purple]
3 4 [magenta]
Upvotes: 3