Reputation: 1
I'm very new to python and pandas and was looking for some help. I'm working off of a CSV, and trying to use pandas to calculate totals for each name based on the value of the middle column. I want the sum of 'count' for each name for 'GEN' and 'NPR' added together. This is my dataset as a CSV:
StartingCSV.csv:
Name, Specialty, Count
Smith, GEN, 1
Smith, INT, 2
Smith, NPR, 5
Smith, PSC, 4
Zane, GEN, 3
Zane, PSC, 4
Zane, NPR, 4
Charles, NPR, 4
Charles, AUD, 4
Desired output:
Smith: 6
Zane: 7
Charles: 4
This is what I have so far:
import csv
import pandas as pd
df = pd.read_csv("StartingCSV.csv")
newdf = df.groupby("Name")
newdf.apply(lambda x: x[x['Specialty'] == 'NPR']['Count'].sum())
This is the output I get:
Smith: 5
Zane: 4
Charles: 4
This returns the number of NPR for each name, but I can't figure out a way to ADD the NPR values to the GEN values for each name, to create the desired output as listed above. Trying to add an "or" after 'NPR' in the final line returns an error like this:
newdf.apply(lambda x: x[x['Specialty'] == 'NPR' or 'GEN']['Count'].sum())
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
Any help is appreciated! (sorry if I'm way off base or if this is unclear)
Upvotes: 0
Views: 428
Reputation: 8033
You can do it this way
df[df['Specialty'].isin(['NPR','GEN'])].groupby('Name').sum().reset_index()
with df[df['Specialty'].isin(['NPR','GEN'])]
we are getting only those rows of the dataframe which have the value 'NPR' or 'GEN' in column 'Specialty'. After that it is the usual groupby
and sum
Output
Name Count
Charles 4
Smith 6
Zane 7
Upvotes: 2