Reputation: 119
I have a DataFrame with start_time
in proper datetime format and start_station_name
as a string that looks like this:
start_time start_station_name
2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
2019-04-06 14:19:06 Folsom St at 9th St
2019-05-24 17:21:11 Golden Gate Ave at Hyde St
2019-03-27 18:53:27 4th St at Mission Bay Blvd S
2019-04-16 08:45:16 Esprit Park
Now I would like to simply plot the frequency of each name over the year in months. To group the data accordingly, I used this:
data = df_clean.groupby(df_clean['start_time'].dt.strftime('%B'))['start_station_name'].value_counts()
Then I get something that is not a DataFrame but represented as a dtype: int64:
start_time start_station_name
April San Francisco Caltrain Station 2 (Townsend St at 4th St) 4866
Market St at 10th St 4609
San Francisco Ferry Building (Harry Bridges Plaza) 4270
Berry St at 4th St 3994
Montgomery St BART Station (Market St at 2nd St) 3550
...
September Mission Bay Kids Park 1026
11th St at Natoma St 1023
Victoria Manalo Draves Park 1018
Davis St at Jackson St 1015
San Francisco Caltrain Station (King St at 4th St) 1014
Now, I would like to simply plot it as a clustered bar chart using Seaborn's countplot()
, only for an absolute frequency above 1000, where the x-axis represents the month, the hue is the name and y-axis should show the counts:
sns.countplot(data = data[data > 1000], x = 'start_time', hue = 'start_station_name')
Then I get the error message Could not interpret input 'start_time'
, probably because it's not a proper DataFrame. How can I group/aggregate it in the first place, so that the visualization works?
Upvotes: 1
Views: 275
Reputation: 5502
Try:
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
ax = sns.countplot(x="start_time", hue="start_station_name", data=data[data.count > 1000])
Explanations:
start_station_name
columns. count
to get the number of cellscount
column to count
using rename
groupby
using reset_index
countplot
(using the second example from the doc).Full code
print(df)
# start_time start_station_name
# 0 2019-03-20 11:04:16 San Francisco Caltrain (Townsend St at 4th St)
# 1 2019-04-06 14:19:06 Folsom St at 9th St
# 2 2019-05-24 17:21:11 Golden Gate Ave at Hyde St
# 3 2019-03-27 18:53:27 4th St at Mission Bay Blvd S
# 4 2019-04-16 08:45:16 Esprit Park
data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
.count() \
.rename(columns={"start_time": "count"}) \
.reset_index()
print(data)
# start_time start_station_name count
# 0 April Esprit Park 1
# 1 April Folsom St at 9th St 1
# 2 March 4th St at Mission Bay Blvd S 1
# 3 March San Francisco Caltrain (Townsend St at 4th St) 1
# 4 May Golden Gate Ave at Hyde St 1
# Filter as you desired
# data = data[data.count > 1000]
# Plot
ax = sns.countplot(x="start_time", hue="start_station_name", data=data)
plt.show()
output
Upvotes: 2