Reputation: 13807
I have an automatic vehicle location (AVL) dataset in .csv
format of the public transit system of a city. I would like to use this AVL dataset to build a GTFS dataset for the purpose of running accessibility analysis.
I've seen a solution of how to create a GTFS dataset based on GPS data stored in SQL database
(here), but I haven't found a solution when the GPS data is stored in .csv
format, which is the case here. I would be glad to have any help on this but I would be glad if the solution would come either in R
or Python
.
I already have the stops.txt
file of the GTFS, but I guess I would need to create the files shapes.txt
, tips.txt
, routes.txt
and stop_times.txt
.
This is how my GPS.csv
dataset looks like:
timestamp order line lat long speed route_name
1: 2016-02-24 00:04:56 B27084 905 -22.9 -43.3 32.00 12860326
2: 2016-02-24 00:05:07 B41878 2302 -22.9 -43.2 0.19 12860386
3: 2016-02-24 00:04:37 B75563 928 -22.9 -43.2 0.00 12867184
4: 2016-02-24 00:05:17 D86084 852 -23.0 -43.6 24.26 12860043
5: 2016-02-24 00:04:58 C41420 -22.9 -43.2 0.00 NA
6: 2016-02-24 00:04:47 C30084 -23.0 -43.3 0.00 NA
Upvotes: 5
Views: 716
Reputation: 616
There are five required files: agency.txt
, routes.txt
, trips.txt
, stop_times.txt
, and stops.txt
. For a pseudo-GTFS that is only intended for the purposes of computing accessibility, a lot of the optional fields in the required files can be omitted, as well as all of the optional files. However you might want to copy real ones or construct them as they can be useful for this purpose (e.g. people will consider fares when choosing how to travel, so you could do with fares.txt
).
Read the specification carefully.
If it's acceptable to imagine that all routes are served by the same agency, yours could simply be:
agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
XXX,My Awesome Agency,http://example.com,,,
i.e. you only need the first three fields.
agency.txt
is intended to repesent:
One or more transit agencies that provide the data in this feed.
You need:
route_id
(primary key)route_short_name
route_long_name
route_type
(must be in range 0–7; indicates mode)Example:
route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color
12860326,XXX,12860326,12860326,,3,,
12860386,XXX,12860386,12860386,,3,,
12867184,XXX,12867184,12867184,,3,,
I don't know what to do with the routes that do not have a route assigned to them in your example data. I also don't know what order
refers to. Perhaps order
is a name for the route? As long as you can come up with something that is the same concept as a "route" identifier, you can use that. For reference, a "route" is defined as:
A route is a group of trips that are displayed to riders as a single service.
A trip is a sequence of two or more stops that occurs at specific time.
You need:
trip_id
(primary key)route_id
(foreign key)service_id
(foreign key)Example:
route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
12860326,1,1,,1,,12860326
12860326,1,2,,1,,12860326
12860386,1,1,,1,,12860386
12860386,1,2,,2,,12860386
direction_id
, while optional, tends to be pretty useful, and I've had several applications that ingest GTFS require it despite its optional status.
service_id
is tricky, and works in conjunction with calendar dates. It allows the GTFS to easily represent, say, "normal" weekday service, and holiday services when holidays fall on weekdays. For your purposes, you can probably just use 1
for everything, but it depends on your application and when your AVL data has been collected. When I worked on a similar application, I maintained a lookup table in my database that told me whether a particular date was a public holiday, and/or a school holiday, and/or during the university semesters, because bus routes changed to suit students.
shape_id
is optional but will be critical if you want to draw your routes on maps, or use tools like OpenTripPlanner.
Times that a vehicle arrives at and departs from individual stops for each trip.
You will need:
stop_id
(primary key)trip_id
(foreign key)arrival_time
departure_time
stop_sequence
This will require the most work when scripting. It will be several orders of magnitude larger than all of the other files combined.
stop_id
and trip_id
happily relate to the stops and trips as already identified. The departure_time
and arrival_time
will be in two rows of the AVL data, and in many cases actually identifying when a service arrived at a stop is the most difficult aspect of this task. It's easier with access to passenger smartcard data, and when a service actually stops you're likely to find spatial clusters of AVL records as the vehicle would not have moved for a particular period of time. However if a stop is empty and no one wants to get off, it will be hard to determine when a service actually "arrived" at the stop---particularly because the behaviour of a driver can sometimes change if they do not intend to make a stop when one is scheduled (e.g. travelling faster or taking a shortcut if they see no one waiting). In your case, the speed
value is likely to be helpful, but be careful not to confuse a passenger stop with an intersection.
stop_sequence
is optional but is another case where applications often expect it to exist. Anyway, if your script can't identify stop_sequence
then you probably can't correctly invent this file.
Example:
trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
1,00:05:07,00:05:54,22018,1,,1,1,0
1,00:07:16,00:08:01,22557,2,,1,1,39
1,00:10:56,00:10:56,22559,3,,1,1,76
Indicating dwelling time is optional, so if this is too hard to work out, arrival_time
and departure_time
can validly be the same moment.
In practice, pickup_type
and drop_off_type
are very influential, but generally impossible to determine from AVL data alone, unless your AVL collector has really thought about supporting GTFS in their archival... which is unfortunately very unlikely. You will probably just have to allow both always, unless you have additional information that you can insert (e.g. "all trips on route 1 after stop 4 in weekday evenings only let passengers off").
stop_id
(primary key)stop_name
stop_lon
stop_lat
You said that you have this already, which is great. The challenge is really in getting this to interface with stop_times
via the stop_id
foreign key. The AVL data I have worked with fortunately identified when services were stopped, and at what stop they were stopped at, using the same code as in the GTFS representation of the schedule.
To get good results with tools like OpenTripPlanner, you will probably need to include a calendar.txt
file. This also helps to identify the period of validity for your pseudo-GTFS, if you're taking the approach of modelling a defined period of time. For example:
service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
1,1,1,1,1,1,0,0,20160224,20160226
2,0,0,0,0,0,1,1,20160224,20160226
3,0,0,0,0,0,1,0,20160224,20160226
This indicates that the modelled period is from 2016-06-24 to 2016-06-26 for those services. Any route requested outside of that range has an undefined travel time. I recommend that you pick a period of no more than a week: more than that and applications consuming the GTFS will start to struggle with the volume of data. Real GTFS data benefits from redundancy that this "pseudo" data cannot.
Don't worry about shape_dist_traveled
, I just use dummy information for that (monotically increasing): it can be inferred from the shape, unless the shape is too generalised.
Example:
shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
12860386,-22.9,-43.3,1,1
12860386,-22.0,-42.9,2,2
The general idea is to use the AVL data at hand to fulfil the minimum requirements of a specification-meeting transit feed. You will probably need to write your own scripts to create these files, because there is no standard for AVL data. You can make some information up, and you will probably need to: most applications will raise exceptions when you try to use an incomplete feed. Indeed in my experience, quite a few applications will actually have problems with feeds that meet only the minimum requirements, because the program is poor and most real-world data goes a bit beyond the minimum standard.
You will probably find deficiencies in your AVL data that make it hard to use. The most notable case of this is routes that did run, but the AVL did not work. In such a case, your pseudo-GTFS will not accurately represent the transit system in practice. These are nearly impossible to detect.
In this case, I don't understand the differences between your order
, line
, and route
fields. You will need to determine where these best fit; I've ignored them because I don't understand what they represent. You need to match the AVL schema to the concepts of the GTFS.
Transit systems tend to be very complicated with lots of little exceptions. You might end up excluding some particularly aberrant cases.
Your latitude and longitude values do not look very precise: if that is real data, you probably will not be able to use shapes.txt
. Try asking for more precision in the vehicle positions.
Upvotes: 3