Transit Data Quirks: Optimizing GTFS Shape Files

We use GTFS (see the specification) files on a daily basis in our line of work. Our transit data visualization and analytics platform VIZ (viz.v-a.io) combines GTFS schedule data and realtime data. And we use schedules based on GTFS for our TEMPO anti-bunching system. This means we get to see a lot of the quirks of individual agencies’ GTFS data. Here, I’ll discuss one in particular: transit agency shape files tend to have way more points than is necessary to accurately represent the route shapes.

In some sense, that’s admirable -- the transit agency is trying to provide the public with data that’s as accurate as possible. However, in some cases this extra data is completely useless, as when a third point lies exactly on a line joining points 1 and 2. In other cases, such as representing curved vehicular paths, the data provided can be overkill. In particular, sub-meter resolution for transit vehicles is in most cases unnecessary.

Why is this extra data annoying? One reason might be the wasted space storing the data, but that’s actually pretty minor -- the size of GTFS data feeds is usually dominated by stop_times.txt, and storage space is cheap. More important are the data and time costs of transmission and visualization. The difference between transmitting 100 kB and 400 kB of shape data can be pretty major if you do it enough. And drawing those extra points on our d3 web frontend can slow down the end user experience.

How to deal with this?

We apply the Ramer-Douglas-Peucker algorithm (RDP) to determine which points are unnecessary. At a high level, given three points on the line, RDP determines how far away the middle point is from the line joining the two ends. If it’s below a certain threshold (we conventionally use 2 meters), then the middle point is identified as unnecessary. The algorithm is then applied recursively on smaller and smaller scales until all unnecessary points have been identified. In practice, this commonly reduces the number of points by 20-95%, though see the table at the end of this post for some more comparative statistics across agencies.

Original GTFS shape detail

Original GTFS shape detail

Processed GTFS shape detail

Processed GTFS shape detail

The figure shows a roundabout on the historic 12-St. Charles streetcar line. The rail line has a fairly sharp curvature, so it demonstrates the limits of the approach. You can see that the cleaned version (on the right) has a slightly more jagged appearance -- that’s the downside. The upside is that the cleaned shape has 91 points instead of 380, a savings of 76%.

Agency

GTFS Feed Date

Points Before

Points After

Reduction

MARTA (Atlanta)

12/14/14

300,104

29,904

90.0%

Los Angeles Metro

03/04/15

715,009

125,682

82.4%

AC Transit (Oakland)

03/15/15

290,818

59,952

79.4%

TriMet (Portland)

04/09/15

1,199,217

248,581

79.3%

SEPTA (Philadelphia)

12/31/14

592,922

132,832

77.6%

HART (Tampa)

12/07/14

41,884

11,580

72.4%

MBTA (Boston)

12/16/14

378,738

119,186

68.5%

RTA New Orleans

01/04/15

45,018

14,966

66.8%

METRO Houston

12/16/14

278,081

101,370

63.5%

VIA (San Antonio)

03/04/15

381,341

140,454

63.2%

MUNI (San Francisco)

12/20/14

70,072

27,530

60.7%

TheBus (Honolulu)

03/20/15

305,835

121,285

60.3%

NCTD (North San Diego County)

12/30/14

49,830

27,796

44.2%

OmniTrans (San Bernadino)

02/27/15

4,693

3,231

31.2%