ike
ike

Reputation: 312

Counting Instances of Multiple Variables in R

I have a large data table Divvy (over 2.4 million records) that appears as such (some columns removed):

X   trip_id     from_station_id.x   to_station_id.x 
 1  1109420     94                  69
 2  1109421     69                  216
 3  1109427     240                 245
 4  1109431     113                 94
 5  1109433     127                 332
 3  1109429     240                 245

I would like to find the number of trips from each station to each opposing station. So for example,

From X     To Y     Sum
94         69       1
240        245      2

etc. and then join it back to the inital table using dplyr to make something like the below and then limit it to distinct from_station_id/to_combos, which I'll use to map routes (I have lat/long for each station):

X   trip_id     from_station_id.x   to_station_id.x   Sum 
 1  1109420     94                  69                1
 2  1109421     69                  216               1
 3  1109427     240                 245               2
 4  1109431     113                 94                1
 5  1109433     127                 332               1
 3  1109429     240                 245               1

I successfully used count to get some of this, such as:

count(Divvy$from_station_id.x==94 & Divvy$to_station_id.x == 69)
  x    freq
1 FALSE 2454553
2  TRUE      81

But this is obviously labor intensive as there are 300 unique stations, so well over 44k poss combinations. I created a helper table thinking I could loop it.

n <- select(Divvy, from_station_id.y )

  from_station_id.x 
1                94                
2                69                
3               240               
4               113               
5               113               
6               127               

   count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1])

      x    freq
1 FALSE 2454553
2  TRUE      81

I felt like a loop such as

output <- matrix(ncol=variables, nrow=iterations)


output <- matrix()
for(i in 1:n)(output[i, count(Divvy$from_station_id.x==n[1,1] & Divvy$to_station_id.x == n[2,1]))

should work but come to think of it that will still only return 300 rows, not 44k, so it would have to then loop back and do n[2] & n[1] etc...

I felt like there might also be a quicker dplyr solution that would let me return a count of each combo and append it directly without the extra steps/table creation, but I haven't found it.

I'm newer to R and I have searched around/think I'm close, but I can't quite connect that last dot of joining that result to Divvy. Any help appreciated.

Upvotes: 4

Views: 5129

Answers (3)

jazzurro
jazzurro

Reputation: 23574

Since you said "limit it to distinct from_station_id/to_combos", the following code seems to provide what you are after. Your data is called mydf.

library(dplyr)
group_by(mydf, from_station_id.x, to_station_id.x) %>%
count(from_station_id.x, to_station_id.x)

#  from_station_id.x to_station_id.x n
#1                69             216 1
#2                94              69 1
#3               113              94 1
#4               127             332 1
#5               240             245 2

Upvotes: 4

Dominic Comtois
Dominic Comtois

Reputation: 10421

I'm not entirely sure that's what you're looking for as a result, but this calculates the number of trips having the same origin and destination. Feel free to comment and let me know if that's not quite what you expect as a final result.

dat <- read.table(text="X   trip_id     from_station_id.x   to_station_id.x 
 1  1109420     94                  69
 2  1109421     69                  216
 3  1109427     240                 245
 4  1109431     113                 94
 5  1109433     127                 332
 3  1109429     240                 245", header=TRUE)

dat$from.to <- paste(dat$from_station_id.x, dat$to_station_id.x, sep="-")
freqs <- as.data.frame(table(dat$from.to))
names(freqs) <- c("from.to", "sum")
dat2 <- merge(dat, freqs, by="from.to")
dat2 <- dat2[order(dat2$trip_id),-1]

Results

dat2

#   X trip_id from_station_id.x to_station_id.x sum
# 6 1 1109420                94              69   1
# 5 2 1109421                69             216   1
# 3 3 1109427               240             245   2
# 4 3 1109429               240             245   2
# 1 4 1109431               113              94   1
# 2 5 1109433               127             332   1

Upvotes: 3

Metrics
Metrics

Reputation: 15458

#Here is the data.table solution, which is useful if you are working with large data: 
library(data.table)
setDT(DF)[,sum:=.N,by=.(from_station_id.x,to_station_id.x)][] #DF is your dataframe

   X trip_id from_station_id.x to_station_id.x sum
1: 1 1109420                94              69   1
2: 2 1109421                69             216   1
3: 3 1109427               240             245   2
4: 4 1109431               113              94   1
5: 5 1109433               127             332   1
6: 3 1109429               240             245   2

Upvotes: 5

Related Questions