Reputation: 1983
I am a super beginner for Python. Long story short, I want to groupby with one column, apply one function to one column, apply another function to another column, and plot the results(the first column to the x-axis, the second column to the y-axis).
I have a pandas data frame df
which contains many columns. Two columns of them are tour_id
and tour_distance
.
tour_id tour_distance
A 10
A 10
A 10
A 10
B 20
B 20
C 40
C 40
C 40
C 40
C 40
: :
: :
Since I assume that the longer tour_distance becomes, the more rows each tour_id has, I want to plot a histogram of tour_distance
vs row counts in each group of tour_id
.
Question 1: what's the simplest solution for this groupby and plot problem?
Question 2: how can I improve my failed attempt?
My attempt: I thought it would be easier to make a new data frame like this.
tour_id tour_distance row_counts
A 10 3
B 20 2
C 40 5
: : :
In this way I can use matplotlib
and do like this,
import matplotlib.pyplot as plt
x = df.tour_distance
y = df.row_counts
plt.bar(x,y)
However, I can't make this data frame.
df_tour_distance = df.groupby('tour_id').tour_distance.head(1)
df_tour_distance = pd.DataFrame(df_tour_distance)
df_size = df.groupby('tour_id').tour_distance.size()
df_size = pd.DataFrame(df_size)
df = pd.merge(df_size, df_tour_distance, on='tour_id')
>>> KeyError: 'tour_id'
This also failed:
g = df.groupby('tour_id')
result = g.agg({'Count':lambda x:x.size(),
'tour_distance_grouped':lambda x:x.head(1)})
result
>>> KeyError: 'Count'
Upvotes: 1
Views: 3605
Reputation: 862
Could be implemented somewhat easier:
import pandas as pd
tour_id = ['A']*4+['B']*2+['C']*5
tour_distance = [10]*4+[20]*2+[40]*5
df = pd.DataFrame({'tour_id': tour_id, 'tour_distance': tour_distance})
df = df.set_index('tour_id')
df2 = pd.DataFrame()
df2['tour_distance'] = df.groupby('tour_id')['tour_distance'].head(1)
df2['row_counts'] = df.groupby('tour_id').count()
print(df2)
Result:
tour_distance row_counts
tour_id
A 10 4
B 20 2
C 40 5
Upvotes: 0
Reputation: 8631
The problem in your code is that once you groupby tour_id
, it becomes index. You have to specify as_index=False
or use reset_index()
in order to use it. Also, you do not need to find a series and then merge it back.
You need:
g = df.groupby(['tour_id', 'tour_distance']).size().reset_index(name='count')
plt.bar(g['tour_id'],g['count'])
Output:
Upvotes: 2