CMcM
CMcM

Reputation: 21

How to deal with list-like data in a pandas DataFrame column

Consider the following example:

I have a table of emails, each with an email id, and two label columns, generated through different code paths, containing a list of labels associated with those emails.

df = pd.DataFrame({
    'id': [1,2,3,4],
    'labels1': [np.array(['red']), np.array(['blue', 'green']), np.array(['blue']), np.nan],
    'labels2': [np.nan, np.nan, np.array(['yellow', 'purple']), np.array(['magenta'])]
})

df
   id        labels1           labels2
0   1          [red]               NaN
1   2  [blue, green]               NaN
2   3         [blue]  [yellow, purple]
3   4            NaN         [magenta]


So, I need a way to produce the following DataFrame:

df_merge
    id                 labels       
0   1                   [red] 
1   2           [blue, green] 
2   3  [blue, yellow, purple] 
3   4               [magenta]

But using lambda functions as I might do with scalar column data throws a ValueError exception:

df.apply(lambda x: np.unique(np.append(x['labels1'], x['labels2'])), axis=1)

ValueError: Shape of passed values is (4, 2), indices imply (4, 4)

I've tried many different variations on the above, all to no avail. I'm wondering if perhaps array-like column data like this is a pandas anti-pattern, and if so, what are better approaches?

Upvotes: 2

Views: 690

Answers (1)

piRSquared
piRSquared

Reputation: 294488

  • make NaN into [] using applymap
  • sum across rows

df[['id']].assign(
    labels=labels.applymap(lambda x: x if isinstance(x, list) else []).sum(1)
)

   id                  labels
0   1                   [red]
1   2           [blue, green]
2   3  [blue, yellow, purple]
3   4               [magenta]

Upvotes: 3

Related Questions