Reputation: 7526
I am working with monthly Citi bike trip data that looks like this:
> head(data) tripduration starttime stoptime start.station.id start.station.name end.station.id 1 732 2015-07-01 00:00:03 2015-07-01 00:12:16 489 10 Ave & W 28 St 368 2 322 2015-07-01 00:00:06 2015-07-01 00:05:29 304 Broadway & Battery Pl 3002 3 790 2015-07-01 00:00:17 2015-07-01 00:13:28 447 8 Ave & W 52 St 358 4 1228 2015-07-01 00:00:23 2015-07-01 00:20:51 490 8 Ave & W 33 St 250 5 1383 2015-07-01 00:00:44 2015-07-01 00:23:48 327 Vesey Pl & River Terrace 72 6 603 2015-07-01 00:01:00 2015-07-01 00:11:04 455 1 Ave & E 44 St 367 end.station.name bikeid usertype birth.year gender 1 Carmine St & 6 Ave 18669 Subscriber 1970 1 2 South End Ave & Liberty St 14618 Subscriber 1984 1 3 Christopher St & Greenwich St 18801 Subscriber 1992 1 4 Lafayette St & Jersey St 19137 Subscriber 1990 1 5 W 52 St & 11 Ave 15808 Subscriber 1988 1 6 E 53 St & Lexington Ave 17069 Subscriber 1953 1
Each trip has it's own unique record, and only for the month of July there are some 1,085,676 trips. The raw data in .csv format can be found here on the Citi Bike website. Normally, bikes start from the station where they ended on the previous trip. However, this is not always the case. Sometimes bikes will start at different station than where they ended, which indicates the bikes were "rebalanced" or moved by a truck from one station to the next to meet station demands. I want to filter out all of the "normal" trips and isolate all of the instances where the bikes started at a different station than where they ended (e.g. start.station.id
is not equal to previous end.station.id
.) The unique identifying factor of bikes is bikeid
, and this must be used. Here is a subset of the month's data by one bikeid (the most frequently ridden bike):
head(onebike) tripduration starttime stoptime start.station.id start.station.name 1952 691 2015-07-01 07:23:24 2015-07-01 07:34:56 161 LaGuardia Pl & W 3 St 2369 332 2015-07-01 07:38:49 2015-07-01 07:44:22 379 W 31 St & 7 Ave 3879 259 2015-07-01 08:14:34 2015-07-01 08:18:54 472 E 32 St & Park Ave 4310 1112 2015-07-01 08:22:53 2015-07-01 08:41:25 498 Broadway & W 32 St 5795 1509 2015-07-01 08:47:18 2015-07-01 09:12:27 345 W 13 St & 6 Ave 7857 1361 2015-07-01 09:23:50 2015-07-01 09:46:32 348 W Broadway & Spring St end.station.id end.station.name bikeid usertype birth.year gender 1952 379 W 31 St & 7 Ave 22075 Subscriber 1985 1 2369 472 E 32 St & Park Ave 22075 Subscriber 1986 1 3879 498 Broadway & W 32 St 22075 Subscriber 1986 1 4310 345 W 13 St & 6 Ave 22075 Customer NA 0 5795 348 W Broadway & Spring St 22075 Customer NA 0 7857 386 Centre St & Worth St 22075 Customer NA 0
The task is now to select instances where the start.station.id
of a row does not equal end.station.id
of the previous row.
The result should contain the bikeid
, the end.station.id
, the start.station.id
, and the time difference between when the bike was dropped off and picked up (indicating approximately when it was moved).
Is the best way to do this to use the shift()
function?
How would one loop through every bike.id in the first dataset (there are about 7000) to reveal all of the hidden movements?
Upvotes: 0
Views: 374
Reputation: 766
I am sure there are hundreds of ways to do this, but since the data is only around 100MB, a simple for loop is very capable and very flexible to modify and extend in this case, so here it is (done in seconds):
raw_data = read.csv("201511-citibike-tripdata.csv")
bikeid <-22075
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0))
for(i in 2:nrow(onebike)) {
if(onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time)
output[nrow(output) + 1,] = new_row
}
}
output
bikeid end.station.id start.station.id diff.time
1 22075 514 520 181.5667
2 22075 356 502 628.8833
Edit: This is to further answer the question in the comments. It is an easy extension to include all the bikeids:
raw_data = read.csv("201511-citibike-tripdata.csv")
unique_id = unique(raw_data$bikeid)
#bikeid <-22075
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0), "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)
for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),
strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S")
,units = "mins"))
new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
output[nrow(output) + 1,] = new_row
}
}
}
}
dim(output)
[1] 32589 6
head(output)
bikeid end.station.id start.station.id diff.time stoptime starttime
1 22545 520 529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2 22545 520 517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3 22545 2004 3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4 22545 296 3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5 22545 520 449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6 22545 359 519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29
Upvotes: 1