Reputation: 9243
I have a Pandas
Dataframe
with columns Project Type
and Parts
. I would like to know how many part As are used in projects of Project Type 1. I am trying to use .count()
, but it doesn't return just a single number.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
print (parts_df[(parts_df['Project Type'] == 'Type 1') & ('A' in parts_df['Parts'])]).count()
Output:
Project Type 0
Parts 0
dtype: int64
Desired Output:
1
Upvotes: 0
Views: 164
Reputation: 5372
You can spend a second to re-format the columns, and make life a little easier:
parts_df.Parts = parts_df.Parts.map(lambda x: ' '.join(x))
# Project type Parts
#0 Type 1 A B
#1 Type 2 B
Now you can use the Series.str.get_dummies
method:
dummies = parts_df.Parts.str.get_dummies( sep=' ')
# A B
#0 1 1
#1 0 1
which shows the presence or absence of each "Part" using either a 1 or 0 respectively. Use this dummies frame to create a dataframe that can easily be manipulated using all of the standard pandas methods (pandas doesn't like lists in columns):
new_parts_df = pandas.concat( (parts_df['Project Type'], dummies), axis=1)
# Project type A B
#0 Type 1 1 1
#1 Type 2 0 1
You can now easily count groups in several ways. The most efficient thing to do would be use pandas.DataFrame.query
, but the unfortunate white space in your column name "Project Type"
makes this difficult. I would avoid white spaces in column names whenever possible. Try this:
new_parts_df.rename( columns={'Project Type': 'Project_Type'}, inplace=True)
print(len(new_parts_df.query( 'Project_Type=="Type 1" and A==1')))
# 1
Upvotes: 0
Reputation: 829
Change the 'A' in df['Parts']
to a lambda.
import pandas as pd
parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['B']]], columns=['Project Type', 'Parts'])
res = (parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['Parts'].apply(lambda x: 'A' in x))]).count()
res.max()
Result:
1
Upvotes: 0
Reputation: 394269
IIUC you want the following:
In [13]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A'))
Out[13]:
0 1
Name: Parts, dtype: int64
If you want the scalar value rather than a series then you can call .values
attribute and index into the np array:
In [15]:
parts_df.loc[parts_df['Project Type'] == 'Type 1','Parts'].apply(lambda x: x.count('A')).values[0]
Out[15]:
1
You could just add a column that counts the 'A' parts:
In [17]:
parts_df['A count'] = parts_df['Parts'].apply(lambda x: x.count('A'))
parts_df
Out[17]:
Project Type Parts A count
0 Type 1 [A, B] 1
1 Type 2 [B] 0
you can then filter:
In [18]:
parts_df[(parts_df['Project Type'] == 'Type 1') & (parts_df['A count'] > 0)]
Out[18]:
Project Type Parts A count
0 Type 1 [A, B] 1
Upvotes: 0
Reputation: 826
you can try something like this :
sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
sample :
In[32]: parts_df = pd.DataFrame(data = [['Type 1', ['A', 'B']], ['Type 2', ['A']], ['Type 1', ['C']]], columns=['Project Type', 'Parts'])
In[33]: sum(['A' in i for i in parts_df[parts_df['Project Type']=='Type 1']['Parts'].tolist()])
Out[33]: 1
Upvotes: 1