Reputation: 59238
I'm reading a CSV file with pandas and after I read the file I'd like to calculate 2 things:
For example if my data is [X,X,Y,Z,Z,X,X,Y,Z,Y]
, I want my output to be
X 4 40.0
Y 3 30.0
Z 3 30.0
I tried the following but it only outputs the sums
train = pd.read_csv("./../input/train.csv")
grouped = train.groupby([x ,y]).size()
And this only calculates the percentages:
train = pd.read_csv("./../input/train.csv")
grouped = grouped.groupby(level=[0]).apply(lambda x: x / x.sum())
How can I get both?
Upvotes: 2
Views: 90
Reputation: 2477
I would calculate the two separately and concatenate them :
d = {'col_one': ['X','X','Y','Z','Z','X','X','Y','Z','Y']}
df = pd.DataFrame(data=d)
nb_rows = len(df)
serie_count = df.groupby('col_one').size().rename('count')
serie_percentage = (100.*serie_count/nb_rows).rename('percentage')
final_df = pd.concat([serie_count, serie_percentage], axis=1)
Output:
count percentage
col_one
X 4 40.0
Y 3 30.0
Z 3 30.0
Upvotes: 1
Reputation: 863166
I think need for percentage column divide by div
new count column by sum
:
df = pd.DataFrame({'A':list('XXYZZXXYZY')})
df = df.groupby('A').size().reset_index(name='count')
df['%'] = df['count'].div(df['count'].sum()).mul(100)
print (df)
A count %
0 X 4 40.0
1 Y 3 30.0
2 Z 3 30.0
Alternative solution with value_counts
:
df = pd.concat([df['A'].value_counts().rename('count'),
df['A'].value_counts(normalize=True).rename('%').mul(100)], axis=1)
df = df.rename_axis('A').reset_index()
print (df)
A count %
0 X 4 40.0
1 Y 3 30.0
2 Z 3 30.0
Upvotes: 3