nic
nic

Reputation: 119

Seaborn visualize grouped frequeny

I have a DataFrame with start_time in proper datetime format and start_station_name as a string that looks like this:

    start_time                      start_station_name
2019-03-20 11:04:16     San Francisco Caltrain (Townsend St at 4th St)
2019-04-06 14:19:06     Folsom St at 9th St
2019-05-24 17:21:11     Golden Gate Ave at Hyde St
2019-03-27 18:53:27     4th St at Mission Bay Blvd S
2019-04-16 08:45:16     Esprit Park

Now I would like to simply plot the frequency of each name over the year in months. To group the data accordingly, I used this:

data = df_clean.groupby(df_clean['start_time'].dt.strftime('%B'))['start_station_name'].value_counts()

Then I get something that is not a DataFrame but represented as a dtype: int64:

start_time  start_station_name                                       
April       San Francisco Caltrain Station 2  (Townsend St at 4th St)    4866
            Market St at 10th St                                         4609
            San Francisco Ferry Building (Harry Bridges Plaza)           4270
            Berry St at 4th St                                           3994
            Montgomery St BART Station (Market St at 2nd St)             3550
                                                                         ... 
September   Mission Bay Kids Park                                        1026
            11th St at Natoma St                                         1023
            Victoria Manalo Draves Park                                  1018
            Davis St at Jackson St                                       1015
            San Francisco Caltrain Station (King St at 4th St)           1014

Now, I would like to simply plot it as a clustered bar chart using Seaborn's countplot(), only for an absolute frequency above 1000, where the x-axis represents the month, the hue is the name and y-axis should show the counts:

sns.countplot(data = data[data > 1000], x = 'start_time', hue = 'start_station_name')

Then I get the error message Could not interpret input 'start_time', probably because it's not a proper DataFrame. How can I group/aggregate it in the first place, so that the visualization works?

Upvotes: 1

Views: 275

Answers (1)

Alexandre B.
Alexandre B.

Reputation: 5502

Try:

data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
        .count() \
        .rename(columns={"start_time": "count"}) \
        .reset_index()
ax = sns.countplot(x="start_time", hue="start_station_name", data=data[data.count > 1000])

Explanations:

  • I change the key in the groupby by adding the start_station_name columns.
  • Use count to get the number of cells
  • Rename the count column to count using rename
  • Reset the index from the groupby using reset_index
  • Subset dataset
  • Plot the result using countplot (using the second example from the doc).

Full code

print(df)
#            start_time                              start_station_name
# 0 2019-03-20 11:04:16  San Francisco Caltrain (Townsend St at 4th St)
# 1 2019-04-06 14:19:06                             Folsom St at 9th St
# 2 2019-05-24 17:21:11                      Golden Gate Ave at Hyde St
# 3 2019-03-27 18:53:27                    4th St at Mission Bay Blvd S
# 4 2019-04-16 08:45:16                                     Esprit Park

data = df.groupby([df['start_time'].dt.strftime('%B'), 'start_station_name']) \
        .count() \
        .rename(columns={"start_time": "count"}) \
        .reset_index()
print(data)
#   start_time                              start_station_name  count
# 0      April                                     Esprit Park      1
# 1      April                             Folsom St at 9th St      1
# 2      March                    4th St at Mission Bay Blvd S      1
# 3      March  San Francisco Caltrain (Townsend St at 4th St)      1
# 4        May                      Golden Gate Ave at Hyde St      1

# Filter as you desired
# data = data[data.count > 1000]

# Plot
ax = sns.countplot(x="start_time", hue="start_station_name", data=data)
plt.show()

output

enter image description here

Upvotes: 2

Related Questions