Reputation: 7356
I am new to python as well as matplotlib. I am trying to plot trip data for each city using a histogram from matplotlib. Here is the sample data i am trying to plot.
Data:
duration month hour day_of_week user_type
0 15.433333 3 23 Thursday Subscriber
1 3.300000 3 22 Thursday Subscriber
2 2.066667 3 22 Thursday Subscriber
3 19.683333 3 22 Thursday Subscriber
4 10.933333 3 22 Thursday Subscriber
5 19.000000 3 21 Thursday Subscriber
6 6.966667 3 21 Thursday Subscriber
7 17.033333 3 20 Thursday Subscriber
8 6.116667 3 20 Thursday Subscriber
9 6.316667 3 20 Thursday Subscriber
10 11.300000 3 20 Thursday Subscriber
11 8.300000 3 20 Thursday Subscriber
12 8.283333 3 19 Thursday Subscriber
13 36.033333 3 19 Thursday Subscriber
14 5.833333 3 19 Thursday Subscriber
15 5.350000 3 19 Thursday Subscriber
Code:
def get_durations_as_list(filename):
with open(filename, 'r') as f_in:
reader = csv.reader(f_in)
next(reader, None)
for row in reader:
if row[4] in ['Subscriber','Registered'] and float(row[0]) < 75:
subscribers.append(float(row[0]))
elif row[4] in ['Casual','Customer'] and float(row[0]) < 75:
customers.append(float(row[0]))
return subscribers,customers
data_files = ['./data/Washington-2016-Summary.csv','./data/Chicago-2016-Summary.csv','./data/NYC-2016-Summary.csv',]
for file in data_files:
city = file.split('-')[0].split('/')[-1]
subscribers,customers = get_durations_as_list(file)
plt.hist(subscribers,range=[min(subscribers),max(subscribers)],bins=5)
plt.title('Distribution of Subscriber Trip Durations for city {}'.format(city))
plt.xlabel('Duration (m)')
plt.show()
plt.hist(customers,range=[min(subscribers),max(subscribers)],bins=5)
plt.title('Distribution of Customers Trip Durations for city {}'.format(city))
plt.xlabel('Duration (m)')
plt.show()
Now the question is how to set the time interval to 5mins wide and how to plot only the trips which are less than 75mins.
I have gone through the documentation but it looks complicated. After reading few stackoverflow question i found that bins are used to set the time interval. Is my assumption correct.
Upvotes: 0
Views: 2602
Reputation: 184
To set the interval of 5 mins with max duration as 75 min, you would need 15 intervals. Hence your bin size will be 75/5.
you can write it either bins=int(75/5)
or as @om tripathi suggested as numpy.arange(0,75,5)
.
Also you need not filter the duration greater than 75 min in the data filtering stage. You can always set the range as range = range(0, 75)
in histogram to discard values greater than 75.
e.g. pyplot.hist(data, bins=numpy.arange(0,75,15) ,range=(0, 75))
Upvotes: 0
Reputation: 300
yes, your assumption is very much correct you can use bins parameter as a sequence. in your case, it will be like.
b = [ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
you can use numpy to genrate the above list.
bins = numpy.arange(0,75,5)
Also, you can use Subscriber and Customer data set in one go below is the function
def plot_duration_type(filename):
city = filename.split('-')[0].split('/')[-1]
with open(filename, 'r') as f_in:
reader = csv.DictReader(f_in)
subscriber_duration = []
customer_duration = []
for row in reader:
if float(row['duration']) < 75 and row['user_type'] == 'Subscriber':
subscriber_duration.append(float(row['duration']))
elif float(row['duration']) < 75 and row['user_type'] == 'Customer':
customer_duration.append(float(row['duration']))
b = [ 0, 5, 10, 15, 20, 25, 30, 35, 40, 45, 50, 55, 60, 65, 70]
plt.hist([subscriber_duration, customer_duration], bins=b, color=['orange', 'green'],
label=['Subscriber', 'Customer'])
title = "{} Distribution of Trip Durations".format(city)
plt.title(title)
plt.xlabel('Duration (m)')
plt.show()
data_file = ['./data/Washington-2016-Summary.csv', './data/Chicago-2016-Summary.csv', './data/NYC-2016-Summary.csv']
for datafile in data_file:
print(plot_duration_type(datafile))
Upvotes: 0
Reputation: 23889
I cannot try it out but here are my thoughts:
The bins
argument can also be a sequence of bin edges. Therefore you can take the minimum and maximum of durations and create a sequence with a step size of 5 (here using the numpy
library):
import numpy as np
sequence = np.arange(min(dat['duration']), max(dat['duration']), 5)
(Maybe you want to floor/ceil the minimum and maximum values to integers.)
Here the code relies on the fact that I read the data using the pandas
library. It can easily be filtered using pandas
as well:
import pandas as pd
dat = pd.read_csv('YOURFILE.csv')
dat_filtered = dat[dat['duration'] < 75]
Happy Holidays.
Upvotes: 1