iskandarblue
iskandarblue

Reputation: 7526

Comparing data in previous row using shift() function

I am working with monthly Citi bike trip data that looks like this:

 
> head(data)
  tripduration           starttime            stoptime start.station.id       start.station.name end.station.id 
1          732 2015-07-01 00:00:03 2015-07-01 00:12:16              489         10 Ave & W 28 St            368
2          322 2015-07-01 00:00:06 2015-07-01 00:05:29              304    Broadway & Battery Pl           3002
3          790 2015-07-01 00:00:17 2015-07-01 00:13:28              447          8 Ave & W 52 St            358
4         1228 2015-07-01 00:00:23 2015-07-01 00:20:51              490          8 Ave & W 33 St            250
5         1383 2015-07-01 00:00:44 2015-07-01 00:23:48              327 Vesey Pl & River Terrace             72
6          603 2015-07-01 00:01:00 2015-07-01 00:11:04              455          1 Ave & E 44 St            367
               end.station.name bikeid   usertype birth.year gender
1            Carmine St & 6 Ave  18669 Subscriber       1970      1
2    South End Ave & Liberty St  14618 Subscriber       1984      1
3 Christopher St & Greenwich St  18801 Subscriber       1992      1
4      Lafayette St & Jersey St  19137 Subscriber       1990      1
5              W 52 St & 11 Ave  15808 Subscriber       1988      1
6       E 53 St & Lexington Ave  17069 Subscriber       1953      1    

Each trip has it's own unique record, and only for the month of July there are some 1,085,676 trips. The raw data in .csv format can be found here on the Citi Bike website. Normally, bikes start from the station where they ended on the previous trip. However, this is not always the case. Sometimes bikes will start at different station than where they ended, which indicates the bikes were "rebalanced" or moved by a truck from one station to the next to meet station demands. I want to filter out all of the "normal" trips and isolate all of the instances where the bikes started at a different station than where they ended (e.g. start.station.id is not equal to previous end.station.id.) The unique identifying factor of bikes is bikeid, and this must be used. Here is a subset of the month's data by one bikeid (the most frequently ridden bike):

head(onebike)
     tripduration           starttime            stoptime start.station.id     start.station.name
1952          691 2015-07-01 07:23:24 2015-07-01 07:34:56              161  LaGuardia Pl & W 3 St
2369          332 2015-07-01 07:38:49 2015-07-01 07:44:22              379        W 31 St & 7 Ave
3879          259 2015-07-01 08:14:34 2015-07-01 08:18:54              472     E 32 St & Park Ave
4310         1112 2015-07-01 08:22:53 2015-07-01 08:41:25              498     Broadway & W 32 St
5795         1509 2015-07-01 08:47:18 2015-07-01 09:12:27              345        W 13 St & 6 Ave
7857         1361 2015-07-01 09:23:50 2015-07-01 09:46:32              348 W Broadway & Spring St
     end.station.id       end.station.name bikeid   usertype birth.year gender
1952            379        W 31 St & 7 Ave  22075 Subscriber       1985      1
2369            472     E 32 St & Park Ave  22075 Subscriber       1986      1
3879            498     Broadway & W 32 St  22075 Subscriber       1986      1
4310            345        W 13 St & 6 Ave  22075   Customer         NA      0
5795            348 W Broadway & Spring St  22075   Customer         NA      0
7857            386   Centre St & Worth St  22075   Customer         NA      0

The task is now to select instances where the start.station.id of a row does not equal end.station.id of the previous row.

The result should contain the bikeid, the end.station.id, the start.station.id, and the time difference between when the bike was dropped off and picked up (indicating approximately when it was moved).

Is the best way to do this to use the shift() function?

How would one loop through every bike.id in the first dataset (there are about 7000) to reveal all of the hidden movements?

Upvotes: 0

Views: 374

Answers (1)

TongZZZ
TongZZZ

Reputation: 766

I am sure there are hundreds of ways to do this, but since the data is only around 100MB, a simple for loop is very capable and very flexible to modify and extend in this case, so here it is (done in seconds):

raw_data = read.csv("201511-citibike-tripdata.csv")
bikeid <-22075
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]
output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0))

for(i in 2:nrow(onebike)) {
  if(onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
    diff_time <- as.double(difftime(strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S"),
                                    strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),units = "mins"))
    new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time)
    output[nrow(output) + 1,] = new_row
  }
}
output

bikeid end.station.id start.station.id diff.time
1  22075            514              520  181.5667
2  22075            356              502  628.8833

Edit: This is to further answer the question in the comments. It is an easy extension to include all the bikeids:

raw_data = read.csv("201511-citibike-tripdata.csv")
unique_id = unique(raw_data$bikeid)
#bikeid <-22075

output <- data.frame("bikeid"= integer(0), "end.station.id"= integer(0), "start.station.id" = integer(0), "diff.time" = numeric(0),  "stoptime" = character(),"starttime" = character(), stringsAsFactors=FALSE)

for (bikeid in unique_id)
{
onebike <- raw_data[ which(raw_data$bikeid== bikeid), ]

if(nrow(onebike) >=2 ){
for(i in 2:nrow(onebike )) {
  if(is.integer(onebike[i-1,"end.station.id"]) & is.integer(onebike[i,"start.station.id"]) &
     onebike[i-1,"end.station.id"] != onebike[i,"start.station.id"]){
    diff_time <- as.double(difftime(strptime(onebike[i,"starttime"], "%m/%d/%Y %H:%M:%S"),
                                    strptime(onebike[i-1,"stoptime"], "%m/%d/%Y %H:%M:%S")
                                    ,units = "mins"))
    new_row <- c(bikeid, onebike[i-1,"end.station.id"], onebike[i,"start.station.id"], diff_time, as.character(onebike[i-1,"stoptime"]), as.character(onebike[i,"starttime"]))
    output[nrow(output) + 1,] = new_row
  }
}
}
}
dim(output)
[1] 32589     6
head(output)
  bikeid end.station.id start.station.id        diff.time           stoptime          starttime
1  22545            520              529 24.8166666666667 11/2/2015 08:38:22 11/2/2015 09:03:11
2  22545            520              517 537.483333333333 11/2/2015 09:39:19 11/2/2015 18:36:48
3  22545           2004             3230 563.066666666667 11/2/2015 22:06:27 11/3/2015 07:29:31
4  22545            296             3236 471.783333333333 11/4/2015 23:40:29 11/5/2015 07:32:16
5  22545            520              449 43.4166666666667 11/9/2015 08:24:06 11/9/2015 09:07:31
6  22545            359              519 30.7166666666667 11/9/2015 09:14:46 11/9/2015 09:45:29

Upvotes: 1

Related Questions