Reputation: 129
For each timestamp I would like to count the number of trains/buses with a coordinate higher than the reference. See here below the dataset:
structure(list(timestamp = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), type = structure(c(3L,
3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 2L, 3L, 3L, 3L, 3L, 3L, 1L,
1L, 1L, 1L, 1L, 2L), .Label = c("Bus", "Reference", "Train"), class = "factor"),
X.coordinate = c(470L, -300L, 25L, 456L, 37L, 19L, 798L,
-56L, 489L, 412L, 350L, 278L, 970L, -65L, -894L, 780L, 265L,
-25L, 365L, 785L, 95L, 85L)), .Names = c("timestamp", "type",
"X.coordinate"), row.names = c(NA, -22L), class = "data.frame")
I would like to add 2 columns. One column that gives for each row with "reference" the number of trains with a coordinate higher than the reference. The second column should give the number of buses with a coordinate higher than the reference.
See here below the desired output:
timestamp type X.coordinate Train_with_higher_X Bus_with_higher_X
1 1 Train 470 NA NA
2 1 Train -300 NA NA
3 1 Train 25 NA NA
4 1 Train 456 NA NA
5 1 Train 37 NA NA
6 1 Bus 19 NA NA
7 1 Bus 798 NA NA
8 1 Bus -56 NA NA
9 1 Bus 489 NA NA
10 1 Bus 412 NA NA
11 1 Reference 350 2 3
12 2 Train 278 NA NA
13 2 Train 970 NA NA
14 2 Train -65 NA NA
15 2 Train -894 NA NA
16 2 Train 780 NA NA
17 2 Bus 265 NA NA
18 2 Bus -25 NA NA
19 2 Bus 365 NA NA
20 2 Bus 785 NA NA
21 2 Bus 95 NA NA
22 2 Reference 85 3 4
I tried to write some code in R to get this in an elegant way, but my programming skills are too limited for the moment. Is there anybody that knows a solution for this problem? Thank you on beforehand!
Upvotes: 1
Views: 1584
Reputation: 3854
Getting familiar with packages like data.table
and dplyr
, like other answers suggest, is definitely a valuable investment of your time. But if you want to solve this problem using more familiar tools, that's also possible. You can do this without any extra packages by combining some base R functions: mapply
, ifelse
, and with
, into the following lines of code (where I've named your data frame d
)
bus_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Bus",], sum(X.coordinate > y))
d$Bus_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)
train_sum <- function(x, y) with(d[d$timestamp == x & d$type == "Train",], sum(X.coordinate > y))
d$Train_with_higher_X <- ifelse(d$type == "Reference", mapply(FUN = bus_sum, d$timestamp, d$X.coordinate), NA)
The result is pretty much exactly what you wanted, I believe.
timestamp type X.coordinate Train_with_higher_X Bus_with_higher_X
1 1 Train 470 NA NA
2 1 Train -300 NA NA
3 1 Train 25 NA NA
4 1 Train 456 NA NA
5 1 Train 37 NA NA
6 1 Bus 19 NA NA
7 1 Bus 798 NA NA
8 1 Bus -56 NA NA
9 1 Bus 489 NA NA
10 1 Bus 412 NA NA
11 1 Reference 350 2 3
12 2 Train 278 NA NA
13 2 Train 970 NA NA
14 2 Train -65 NA NA
15 2 Train -894 NA NA
16 2 Train 780 NA NA
17 2 Bus 265 NA NA
18 2 Bus -25 NA NA
19 2 Bus 365 NA NA
20 2 Bus 785 NA NA
21 2 Bus 95 NA NA
22 2 Reference 85 3 4
The functions bus_sum
and train_sum
take a timestamp and a reference coordinate, and count the number of rows of each type that have an X.coordinate
greater than the reference coordinate (using with
restricts consideration to only the part of the data frame that matches that timestamp and type). These two functions differ only in d$type == "Bus"
vs d$type == "Train"
, so it would be quite easy to generalize this with another argument to the function. Putting each function inside mapply
calls it for the timestamp and type of each row. Then wrapping that mapply
inside an ifelse
just manually sets all non-reference rows to NA
.
Upvotes: 1
Reputation: 93881
As an alternative to mixing summary data and raw data in a single data frame (and hence having lots of missing values in the summary columns), how about these options:
library(dplyr)
Count number of buses and trains greater than reference, by timestamp:
dat %>% group_by(timestamp) %>%
mutate(Reference = X.coordinate[type=="Reference"]) %>%
filter(type != "Reference") %>%
group_by(timestamp, type) %>%
summarise(Reference = unique(Reference),
numGTref = sum(X.coordinate > Reference))
timestamp type Reference numGTref
1 1 Bus 350 3
2 1 Train 350 2
3 2 Bus 85 4
4 2 Train 85 3
Mark buses and trains that are greater than reference, by timestamp:
dat %>% group_by(timestamp) %>%
mutate(Reference = X.coordinate[type=="Reference"]) %>%
filter(type != "Reference") %>%
group_by(timestamp, type) %>%
mutate(Status = ifelse(X.coordinate > Reference,
paste("Greater than", Reference),
paste("Less than", Reference))) %>%
select(-Reference)
timestamp type X.coordinate Status
1 1 Train 470 Greater than 350
2 1 Train -300 Less than 350
3 1 Train 25 Less than 350
4 1 Train 456 Greater than 350
5 1 Train 37 Less than 350
6 1 Bus 19 Less than 350
7 1 Bus 798 Greater than 350
8 1 Bus -56 Less than 350
9 1 Bus 489 Greater than 350
10 1 Bus 412 Greater than 350
11 2 Train 278 Greater than 85
12 2 Train 970 Greater than 85
13 2 Train -65 Less than 85
14 2 Train -894 Less than 85
15 2 Train 780 Greater than 85
16 2 Bus 265 Greater than 85
17 2 Bus -25 Less than 85
18 2 Bus 365 Greater than 85
19 2 Bus 785 Greater than 85
20 2 Bus 95 Greater than 85
Upvotes: 1
Reputation: 34763
This gets the point across (using data.table
):
library(data.table)
setDT(DF)
DF[ , Train_with_higher_X :=
sum(X.coordinate[type == "Train"] >
X.coordinate[type == "Reference"]), by = timestamp]
And similarly for Bus_with_higher_X
.
It's a bit more complicated to leave the value as NA
for rows where type
is not "Reference"
, but if you're committed to that, fix it up ex-post:
DF[type != "Reference", Train_with_higher_X := NA]
This approach also relies on there only being one row with type
"Reference"
within each value of timestamp
.
Upvotes: 1