user2242044
user2242044

Reputation: 9243

Filtering Pandas dataframe on two criteria where one column is a list

I have a Pandas Dataframe with columns Project Type and Parts. I would like to know how many part As are used in projects of Project Type 1. I am trying to use .count(), but it doesn't return just a single number.

import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
print (parts_df[(parts_df['Project Type'] == 'Type 1') & ('A' in parts_df['Parts'])]).count()

Output:

Project Type    0
Parts           0
dtype: int64

Desired Output:

1

Upvotes: 0

Views: 164

Answers (4)

dermen
dermen

Reputation: 5372

You can spend a second to re-format the columns, and make life a little easier:

parts_df.Parts = parts_df.Parts.map(lambda x: ' '.join(x))
#  Project type Parts
#0       Type 1   A B
#1       Type 2     B

Now you can use the Series.str.get_dummies method:

dummies = parts_df.Parts.str.get_dummies( sep=' ')
#   A  B
#0  1  1
#1  0  1

which shows the presence or absence of each "Part" using either a 1 or 0 respectively. Use this dummies frame to create a dataframe that can easily be manipulated using all of the standard pandas methods (pandas doesn't like lists in columns):

new_parts_df = pandas.concat( (parts_df['Project Type'], dummies), axis=1)
#  Project type  A  B
#0       Type 1  1  1
#1       Type 2  0  1

You can now easily count groups in several ways. The most efficient thing to do would be use pandas.DataFrame.query, but the unfortunate white space in your column name "Project Type" makes this difficult. I would avoid white spaces in column names whenever possible. Try this:

new_parts_df.rename( columns={'Project Type': 'Project_Type'}, inplace=True)
print(len(new_parts_df.query( 'Project_Type=="Type 1" and A==1')))
# 1

Upvotes: 0

Brian Pendleton
Brian Pendleton

Reputation: 829

Change the 'A' in df['Parts'] to a lambda.

import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
res = (parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['Parts'].apply(lambda x: 'A' in x))]).count()
res.max()

Result:

1

Upvotes: 0

EdChum
EdChum

Reputation: 394269

IIUC you want the following:

In [13]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A'))

Out[13]:
0    1
Name: Parts, dtype: int64

If you want the scalar value rather than a series then you can call .values attribute and index into the np array:

In [15]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A')).values[0]

Out[15]:
1

You could just add a column that counts the 'A' parts:

In [17]:
parts_df['A count'] = parts_df['Parts'].apply(lambda x: x.count('A'))
parts_df
Out[17]:
  Project Type   Parts  A count
0       Type 1  [A, B]        1
1       Type 2     [B]        0

you can then filter:

In [18]:    
parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['A count'] > 0)]

Out[18]:
  Project Type   Parts  A count
0       Type 1  [A, B]        1

Upvotes: 0

Alex
Alex

Reputation: 826

you can try something like this :

sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])

sample :

In[32]: parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['A']], ['Type 1', ['C']]], columns=['Project Type', 'Parts'])
In[33]: sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
Out[33]: 1

Upvotes: 1

Related Questions