Notes on ITMS data

Hello, posting here some derived datasets I’ve gotten after processing the ITMS data downloaded from the google drive folder.

Using simple drop_duplicates() like functions in python pandas, here is: : All instances of bus names with stop_id’s, by date. : All instances of bus names with route_id’s, by date.

How this might help (maybe!):

  • see how frequently / not frequently each bus has gotten tagged with routes or stops by the itms system.
  • find which buses are having most/least routes / stops in their day’s journey
  • cross-reference with static gtfs data to ascertain which buses are following which routes (that’ll probably need more work but this is a good starting step)
  • Knowing the available lists of buses / routes to query.
  • Finding if certain buses went into operation or out of operation from certain dates.

Do post your findings here as well!


1 Like

Here are some of the observations and questions I have about the ITMS dataset.

  • Why is there very little data for weeks 45 to 48?

  • Of the 21,805 trips, the field data corresponds to less than 10% of the trips.

  • And of the 2k trips (approx), less than 50% of the trips have sufficient data that can be modeled.
    Did anyone else make a similar conclusion? :thinking:

  • @admin, For the evaluation, are stop_IDs from all the trips considered or only those from the ‘good trips’?

Hi Raghavendra,
Thanks for your observations.
We will ensure that testing points from only good trips will be tested.
Additionally, we encourage you all to add the trips you feel are good here

We will ensure that only trips from this list will be tested against.

Sure, I’m working on standardising the method to asses the quality of a trip. Will post the trips along with a small write-up once I get it running.

You mean to say that the data which we have in stoptimes.txt is less when compared to actual data from gdrive?

Hi @meer1992, No, it’s the other way around.
stop_times.txt has all the possible stops and bus schedules that are planned.
The sensor data (from the g_drive) is only from a few routes and not all. Among those routes, some trips have sufficient/more data while the others have very little to be able to model them.

1 Like

Hi, posting another set of derived data: I processed the historical data shared on google drive, took 10-min time sampling and got counts of unique values of bus name, route_id, trip_id, stop_id in every 10-min time period.

Visualized: as zoomable time-series chart: