Subsetting data frame based on max number of instances of another column

Question

I have some data in the following format, where multiple sensors have data with timestamps. The is some overlap on timestamps between each sensor, however there are gaps in some of the sensors where no data is present at all. I am looking to subset the data in a way where I only pull timestamps/rows where there is a record for each sensor. Here is a small example:

library(lubridate)
set.seed(88)

start_date = as.POSIXct("2018-08-13 17:30:00 GMT")
datetime = seq(start_date, start_date + minutes(20), by = "5 min")

df <- data.frame(Datetime = datetime, X = runif(15), Sensor = 
c(rep(1,5),rep(2,3),rep(3,4),rep(4,3)))
df$Datetime[12] <- df$Datetime[12] + minutes(5)
df

              Datetime          X Sensor
1  2018-08-13 17:30:00 0.71008793      1
2  2018-08-13 17:35:00 0.23754185      1
3  2018-08-13 17:40:00 0.43144271      1
4  2018-08-13 17:45:00 0.06922129      1
5  2018-08-13 17:50:00 0.78754994      1
6  2018-08-13 17:30:00 0.74219767      2
7  2018-08-13 17:35:00 0.49680693      2
8  2018-08-13 17:40:00 0.28491360      2
9  2018-08-13 17:45:00 0.62937634      3
10 2018-08-13 17:50:00 0.28642922      3
11 2018-08-13 17:30:00 0.44928292      3
12 2018-08-13 17:40:00 0.97428628      3
13 2018-08-13 17:40:00 0.10112843      4
14 2018-08-13 17:45:00 0.89788230      4
15 2018-08-13 17:50:00 0.53922885      4

Desired output here would be:

                  Datetime         X Sensor
3  2018-08-13 17:40:00 0.4314427      1
8  2018-08-13 17:40:00 0.2849136      2
12 2018-08-13 17:40:00 0.9742863      3
13 2018-08-13 17:40:00 0.1011284      4

M-- · Accepted Answer

We can do it in dplyr by keeping only dates that have entries equal to number of sensors;

df %>% group_by(Datetime) %>%  filter(n()==n_distinct(.$Sensor))

#> # A tibble: 4 x 3
#> # Groups:   Datetime [1]
#>   Datetime                X Sensor
#>                   
#> 1 2018-08-13 17:40:00 0.741      1
#> 2 2018-08-13 17:40:00 0.760      2
#> 3 2018-08-13 17:40:00 0.744      3
#> 4 2018-08-13 17:40:00 0.553      4

Another method in base would be converting the data from long to wide and keeping only completed cases (the ones with data for all Sensors) and then convert the results back to long format;

wide <- reshape(df, idvar = "Datetime", timevar = "Sensor", direction = "wide")

reshape(na.omit(wide), direction = "long", varying = list(names(wide)[-1]), v.names = "X", 
        idvar = "Datetime", timevar = "Sensor", times = unique(df$Sensor)); rm(wide)
#>                                  Datetime Sensor          X
#> 2018-08-13 17:40:00.1 2018-08-13 17:40:00      1  0.7410448
#> 2018-08-13 17:40:00.2 2018-08-13 17:40:00      2  0.7602078
#> 2018-08-13 17:40:00.3 2018-08-13 17:40:00      3  0.7441643
#> 2018-08-13 17:40:00.4 2018-08-13 17:40:00      4  0.5529621

^{Created on 2019-05-31 by the reprex package (v0.3.0)}

Subsetting data frame based on max number of instances of another column

Answers (1)

Related Questions