rafa.pereira
rafa.pereira

Reputation: 13807

Create a pseudo GTFS dataset from AVL (GPS) data in .CSV format

I have an automatic vehicle location (AVL) dataset in .csv format of the public transit system of a city. I would like to use this AVL dataset to build a GTFS dataset for the purpose of running accessibility analysis.

I've seen a solution of how to create a GTFS dataset based on GPS data stored in SQL database(here), but I haven't found a solution when the GPS data is stored in .csv format, which is the case here. I would be glad to have any help on this but I would be glad if the solution would come either in R or Python.

I already have the stops.txt file of the GTFS, but I guess I would need to create the files shapes.txt, tips.txt, routes.txt and stop_times.txt.

This is how my GPS.csv dataset looks like:

             timestamp  order  line      lat      long      speed route_name
1: 2016-02-24 00:04:56 B27084   905    -22.9     -43.3      32.00   12860326
2: 2016-02-24 00:05:07 B41878  2302    -22.9     -43.2       0.19   12860386
3: 2016-02-24 00:04:37 B75563   928    -22.9     -43.2       0.00   12867184
4: 2016-02-24 00:05:17 D86084   852    -23.0     -43.6      24.26   12860043
5: 2016-02-24 00:04:58 C41420          -22.9     -43.2       0.00         NA
6: 2016-02-24 00:04:47 C30084          -23.0     -43.3       0.00         NA

Upvotes: 5

Views: 716

Answers (1)

alphabetasoup
alphabetasoup

Reputation: 616

There are five required files: agency.txt, routes.txt, trips.txt, stop_times.txt, and stops.txt. For a pseudo-GTFS that is only intended for the purposes of computing accessibility, a lot of the optional fields in the required files can be omitted, as well as all of the optional files. However you might want to copy real ones or construct them as they can be useful for this purpose (e.g. people will consider fares when choosing how to travel, so you could do with fares.txt).

Read the specification carefully.

agency

If it's acceptable to imagine that all routes are served by the same agency, yours could simply be:

agency_id,agency_name,agency_url,agency_timezone,agency_lang,agency_phone
XXX,My Awesome Agency,http://example.com,,,

i.e. you only need the first three fields.

agency.txt is intended to repesent:

One or more transit agencies that provide the data in this feed.

routes

You need:

  • route_id (primary key)
  • route_short_name
  • route_long_name
  • route_type (must be in range 0–7; indicates mode)

Example:

route_id,agency_id,route_short_name,route_long_name,route_desc,route_type,route_url,route_color
12860326,XXX,12860326,12860326,,3,,
12860386,XXX,12860386,12860386,,3,,
12867184,XXX,12867184,12867184,,3,,

I don't know what to do with the routes that do not have a route assigned to them in your example data. I also don't know what order refers to. Perhaps order is a name for the route? As long as you can come up with something that is the same concept as a "route" identifier, you can use that. For reference, a "route" is defined as:

A route is a group of trips that are displayed to riders as a single service.

trips

A trip is a sequence of two or more stops that occurs at specific time.

You need:

  • trip_id (primary key)
  • route_id (foreign key)
  • service_id (foreign key)

Example:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id
12860326,1,1,,1,,12860326
12860326,1,2,,1,,12860326
12860386,1,1,,1,,12860386
12860386,1,2,,2,,12860386

direction_id, while optional, tends to be pretty useful, and I've had several applications that ingest GTFS require it despite its optional status.

service_id is tricky, and works in conjunction with calendar dates. It allows the GTFS to easily represent, say, "normal" weekday service, and holiday services when holidays fall on weekdays. For your purposes, you can probably just use 1 for everything, but it depends on your application and when your AVL data has been collected. When I worked on a similar application, I maintained a lookup table in my database that told me whether a particular date was a public holiday, and/or a school holiday, and/or during the university semesters, because bus routes changed to suit students.

shape_id is optional but will be critical if you want to draw your routes on maps, or use tools like OpenTripPlanner.

stop_times

Times that a vehicle arrives at and departs from individual stops for each trip.

You will need:

  • stop_id (primary key)
  • trip_id (foreign key)
  • arrival_time
  • departure_time
  • stop_sequence

This will require the most work when scripting. It will be several orders of magnitude larger than all of the other files combined.

stop_id and trip_id happily relate to the stops and trips as already identified. The departure_time and arrival_time will be in two rows of the AVL data, and in many cases actually identifying when a service arrived at a stop is the most difficult aspect of this task. It's easier with access to passenger smartcard data, and when a service actually stops you're likely to find spatial clusters of AVL records as the vehicle would not have moved for a particular period of time. However if a stop is empty and no one wants to get off, it will be hard to determine when a service actually "arrived" at the stop---particularly because the behaviour of a driver can sometimes change if they do not intend to make a stop when one is scheduled (e.g. travelling faster or taking a shortcut if they see no one waiting). In your case, the speed value is likely to be helpful, but be careful not to confuse a passenger stop with an intersection.

stop_sequence is optional but is another case where applications often expect it to exist. Anyway, if your script can't identify stop_sequence then you probably can't correctly invent this file.

Example:

trip_id,arrival_time,departure_time,stop_id,stop_sequence,stop_headsign,pickup_type,drop_off_type,shape_dist_traveled
1,00:05:07,00:05:54,22018,1,,1,1,0
1,00:07:16,00:08:01,22557,2,,1,1,39
1,00:10:56,00:10:56,22559,3,,1,1,76

Indicating dwelling time is optional, so if this is too hard to work out, arrival_time and departure_time can validly be the same moment.

In practice, pickup_type and drop_off_type are very influential, but generally impossible to determine from AVL data alone, unless your AVL collector has really thought about supporting GTFS in their archival... which is unfortunately very unlikely. You will probably just have to allow both always, unless you have additional information that you can insert (e.g. "all trips on route 1 after stop 4 in weekday evenings only let passengers off").

stops

  • stop_id (primary key)
  • stop_name
  • stop_lon
  • stop_lat

You said that you have this already, which is great. The challenge is really in getting this to interface with stop_times via the stop_id foreign key. The AVL data I have worked with fortunately identified when services were stopped, and at what stop they were stopped at, using the same code as in the GTFS representation of the schedule.

calendar

To get good results with tools like OpenTripPlanner, you will probably need to include a calendar.txt file. This also helps to identify the period of validity for your pseudo-GTFS, if you're taking the approach of modelling a defined period of time. For example:

service_id,monday,tuesday,wednesday,thursday,friday,saturday,sunday,start_date,end_date
1,1,1,1,1,1,0,0,20160224,20160226
2,0,0,0,0,0,1,1,20160224,20160226
3,0,0,0,0,0,1,0,20160224,20160226

This indicates that the modelled period is from 2016-06-24 to 2016-06-26 for those services. Any route requested outside of that range has an undefined travel time. I recommend that you pick a period of no more than a week: more than that and applications consuming the GTFS will start to struggle with the volume of data. Real GTFS data benefits from redundancy that this "pseudo" data cannot.

shapes

Don't worry about shape_dist_traveled, I just use dummy information for that (monotically increasing): it can be inferred from the shape, unless the shape is too generalised.

Example:

shape_id,shape_pt_lat,shape_pt_lon,shape_pt_sequence,shape_dist_traveled
12860386,-22.9,-43.3,1,1
12860386,-22.0,-42.9,2,2

Note

The general idea is to use the AVL data at hand to fulfil the minimum requirements of a specification-meeting transit feed. You will probably need to write your own scripts to create these files, because there is no standard for AVL data. You can make some information up, and you will probably need to: most applications will raise exceptions when you try to use an incomplete feed. Indeed in my experience, quite a few applications will actually have problems with feeds that meet only the minimum requirements, because the program is poor and most real-world data goes a bit beyond the minimum standard.

You will probably find deficiencies in your AVL data that make it hard to use. The most notable case of this is routes that did run, but the AVL did not work. In such a case, your pseudo-GTFS will not accurately represent the transit system in practice. These are nearly impossible to detect.

In this case, I don't understand the differences between your order, line, and route fields. You will need to determine where these best fit; I've ignored them because I don't understand what they represent. You need to match the AVL schema to the concepts of the GTFS.

Transit systems tend to be very complicated with lots of little exceptions. You might end up excluding some particularly aberrant cases.

Your latitude and longitude values do not look very precise: if that is real data, you probably will not be able to use shapes.txt. Try asking for more precision in the vehicle positions.

Upvotes: 3

Related Questions