Martijn
Martijn

Reputation: 129

R- COUNTIFS function

For each timestamp I would like to count the number of trains/buses with a coordinate higher than the reference. See here below the dataset:

structure(list(timestamp = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), type = structure(c(3L, 
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L, 
1L, 1L, 1L, 1L, 2L), .Label = c("Bus", "Reference", "Train"), class = "factor"), 
    X.coordinate = c(470L, -300L, 25L, 456L, 37L, 19L, 798L, 
    -56L, 489L, 412L, 350L, 278L, 970L, -65L, -894L, 780L, 265L, 
    -25L, 365L, 785L, 95L, 85L)), .Names = c("timestamp", "type", 
"X.coordinate"), row.names = c(NA, -22L), class = "data.frame")

I would like to add 2 columns. One column that gives for each row with "reference" the number of trains with a coordinate higher than the reference. The second column should give the number of buses with a coordinate higher than the reference.

See here below the desired output:

 timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

I tried to write some code in R to get this in an elegant way, but my programming skills are too limited for the moment. Is there anybody that knows a solution for this problem? Thank you on beforehand!

Upvotes: 1

Views: 1584

Answers (3)

Empiromancer
Empiromancer

Reputation: 3854

Getting familiar with packages like data.table and dplyr, like other answers suggest, is definitely a valuable investment of your time. But if you want to solve this problem using more familiar tools, that's also possible. You can do this without any extra packages by combining some base R functions: mapply, ifelse, and with, into the following lines of code (where I've named your data frame d)

bus_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Bus",], sum(X.coordinate > y))
d$Bus_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

train_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Train",], sum(X.coordinate > y))
d$Train_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)

The result is pretty much exactly what you wanted, I believe.

   timestamp      type X.coordinate Train_with_higher_X Bus_with_higher_X
1          1     Train          470                  NA                NA
2          1     Train         -300                  NA                NA
3          1     Train           25                  NA                NA
4          1     Train          456                  NA                NA
5          1     Train           37                  NA                NA
6          1       Bus           19                  NA                NA
7          1       Bus          798                  NA                NA
8          1       Bus          -56                  NA                NA
9          1       Bus          489                  NA                NA
10         1       Bus          412                  NA                NA
11         1 Reference          350                   2                 3
12         2     Train          278                  NA                NA
13         2     Train          970                  NA                NA
14         2     Train          -65                  NA                NA
15         2     Train         -894                  NA                NA
16         2     Train          780                  NA                NA
17         2       Bus          265                  NA                NA
18         2       Bus          -25                  NA                NA
19         2       Bus          365                  NA                NA
20         2       Bus          785                  NA                NA
21         2       Bus           95                  NA                NA
22         2 Reference           85                   3                 4

The functions bus_sum and train_sum take a timestamp and a reference coordinate, and count the number of rows of each type that have an X.coordinate greater than the reference coordinate (using with restricts consideration to only the part of the data frame that matches that timestamp and type). These two functions differ only in d$type == "Bus" vs d$type == "Train", so it would be quite easy to generalize this with another argument to the function. Putting each function inside mapply calls it for the timestamp and type of each row. Then wrapping that mapply inside an ifelse just manually sets all non-reference rows to NA.

Upvotes: 1

eipi10
eipi10

Reputation: 93881

As an alternative to mixing summary data and raw data in a single data frame (and hence having lots of missing values in the summary columns), how about these options:

library(dplyr)

Count number of buses and trains greater than reference, by timestamp:

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  summarise(Reference = unique(Reference), 
            numGTref = sum(X.coordinate > Reference))

  timestamp   type Reference numGTref
1         1    Bus       350        3
2         1  Train       350        2
3         2    Bus        85        4
4         2  Train        85        3

Mark buses and trains that are greater than reference, by timestamp:

dat %>% group_by(timestamp) %>%
  mutate(Reference = X.coordinate[type=="Reference"]) %>%
  filter(type != "Reference") %>%
  group_by(timestamp, type) %>%
  mutate(Status = ifelse(X.coordinate > Reference, 
                         paste("Greater than", Reference), 
                         paste("Less than", Reference))) %>%
  select(-Reference)

   timestamp   type X.coordinate           Status
1          1  Train          470 Greater than 350
2          1  Train         -300    Less than 350
3          1  Train           25    Less than 350
4          1  Train          456 Greater than 350
5          1  Train           37    Less than 350
6          1    Bus           19    Less than 350
7          1    Bus          798 Greater than 350
8          1    Bus          -56    Less than 350
9          1    Bus          489 Greater than 350
10         1    Bus          412 Greater than 350
11         2  Train          278  Greater than 85
12         2  Train          970  Greater than 85
13         2  Train          -65     Less than 85
14         2  Train         -894     Less than 85
15         2  Train          780  Greater than 85
16         2    Bus          265  Greater than 85
17         2    Bus          -25     Less than 85
18         2    Bus          365  Greater than 85
19         2    Bus          785  Greater than 85
20         2    Bus           95  Greater than 85

Upvotes: 1

MichaelChirico
MichaelChirico

Reputation: 34763

This gets the point across (using data.table):

library(data.table)
setDT(DF)

DF[ , Train_with_higher_X :=
      sum(X.coordinate[type == "Train"] > 
            X.coordinate[type == "Reference"]), by = timestamp]

And similarly for Bus_with_higher_X.

It's a bit more complicated to leave the value as NA for rows where type is not "Reference", but if you're committed to that, fix it up ex-post:

DF[type != "Reference", Train_with_higher_X := NA]

This approach also relies on there only being one row with type "Reference" within each value of timestamp.

Upvotes: 1

Related Questions