Reputation: 20000
I am using pandas to process a csv file with the following structure
date, student, score, outof
20040601,mark,80,100
20040601,jack,40,100
20040602,mark,60,100
20040602,jack,30,100
I want to group the above data by date and then calculate the percentage for each date and then plot it.
I am able to do groupby, by using the following code
import pandas as pd
data = pd.read_csv("csv_file")
grouped_by_date = data.groupby('date')
I am also able to calculate the percentage for each line using the following code
import pandas as pd
data = pd.read_csv("csv_file")
date['percentage'] = data['score']/data['outof']
But I am not able to do it on the grouped data. How to do it on the grouped by data?
Edit:
The following is the desired output that I want
date, percentage
20040601,60
20040602,45
Upvotes: 1
Views: 1855
Reputation: 4577
Simply:
data['percentage'] = data.score/data.outof
data.groupby('date').mean()
Yields:
score outof percentage
date
20040601 60 100 0.60
20040602 45 100 0.45
Upvotes: 1
Reputation: 16508
The following is what I would do:
>>> df.set_index(['date'], inplace=True)
>>> df['percentDate'] = df.groupby(level=0).apply(lambda x: np.mean(x['score']/x['outof']))
>>> df
Out[391]:
student score outof percentDate
date
20040601 mark 80 100 0.60
20040601 jack 40 100 0.60
20040602 mark 60 100 0.45
20040602 jack 30 100 0.45
First of all, since you want to set the values into the dataframe as a column, it's nice to set the index according to what you group-by: it makes setting the values later on easier (to me).
Second, Observe that I have used numpy's mean()
function: since score/outof will return a column of observations (one row per student), you have to average out over that.
Upvotes: 1