MichaelK
MichaelK

Reputation: 15

manipulating a pandas dataframe column containing a list

I have used the following code with the unique() function in pandas to create a column which then contains a list of unique values:

import pandas as pd
from collections import OrderedDict
dct = OrderedDict([
('referencenum',['10','10','20','20','20','30','30','40']),
('Month',['Jan','Jan','Jan','Feb','Feb','Feb','Feb','Mar']),
('Category',['good','bad','bad','bad','bad','good','bad','bad'])
                 ])
df = pd.DataFrame.from_dict(dct)

This gives the following sample dataset:

  referencenum Month Category
0           10   Jan     good
1           10   Jan      bad
2           20   Jan      bad
3           20   Feb      bad
4           20   Feb      bad
5           30   Feb     good
6           30   Feb      bad
7           40   Mar      bad

Then I summarise as follows:

dfsummary = pd.DataFrame(df.groupby(['referencenum', 'Month'])['Category'].unique())
dfsummary.reset_index()

To give the summary dataframe with "Category" column containing a list

referencenum    Month         Category
0   10          Jan           [good, bad]
1   20          Feb           [bad]
2   20          Jan           [bad]
3   30          Feb           [good, bad]
4   40          Mar           [bad]

My question is how do I obtain another column containing the len() or number of items in the Category "list" column?

Also - how do extract the first/ second item in the list to another column?

Can I do these manipulations within pandas or do I somehow need to drop out to list manipulations and then come back to pandas?

Many thanks!

Upvotes: 1

Views: 1774

Answers (2)

Rafaó
Rafaó

Reputation: 599

If you want to get the number of elements of each entry in Category column, you should use len() method with apply():

dfsummary['Category_len'] = dfsummary['Category'].apply(len)

Upvotes: 0

gmds
gmds

Reputation: 19885

You should check out the accessors.

Basically, they're ways to handle the values contained in a Series that are specific to their type (datetime, string, etc.).

In this case, you would use df['Category'].str.len().

If you wanted the first element, you would use df['Category'].str[0].

To generalise: you can treat the elements of a Series as a collection of objects by referring to its .str property.

Upvotes: 1

Related Questions