Rfanatic
Rfanatic

Reputation: 2282

How to match values based on column?

I would like to identify the length of measurement was taken during a specific day based on id and id1. The measurement length is recorded in the d1...d5 variables and d_day defines the length of the measurement I am looking for. For example in the first case, I am after d3 measurement length that equals 30. I tried the which functions but sure there are faster solutions.

Desired output

enter image description here

Data structure:

enter image description here

Sample data:

structure(list(id = c(101, 101), id_1 = c(1, 2), d1 = c(15, 0
), d2 = c(30, 15), d3 = c(25, 30), d4 = c(15, 30), d5 = c(15, 
60), d_day = c("d3", "d1")), class = c("spec_tbl_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -2L), spec = structure(list(
    cols = list(id = structure(list(), class = c("collector_double", 
    "collector")), id_1 = structure(list(), class = c("collector_double", 
    "collector")), d1 = structure(list(), class = c("collector_double", 
    "collector")), d2 = structure(list(), class = c("collector_double", 
    "collector")), d3 = structure(list(), class = c("collector_double", 
    "collector")), d4 = structure(list(), class = c("collector_double", 
    "collector")), d5 = structure(list(), class = c("collector_double", 
    "collector")), d_day = structure(list(), class = c("collector_character", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

Upvotes: 0

Views: 53

Answers (3)

GKi
GKi

Reputation: 39657

As your d_day contains already the column number it could directly used for subsetting which should be faster than using match.

cbind(x[1:2], d_day_h = x[3:7][cbind(seq_len(nrow(x)),
  as.numeric(substring(x$d_day, 2)))])
#   id id_1 d_day_h
#1 101    1      25
#2 101    2       0

Benchmark:

library(dplyr)
library(tidyr)
bench::mark(
   Gki = {cbind(x[1:2], d_day_h = as.data.frame(x[3:7])[cbind(seq_len(nrow(x)), as.numeric(substring(x$d_day, 2)))])}
 , KarthikS = {as.data.frame(x %>% pivot_longer(-c(id, id_1, d_day)) %>% filter(d_day == name) %>% 
                                     select(id, id_1, 'd_day_h' = value))}
 , GuedesBF = {as.data.frame(x %>% 
        rowwise %>%
        mutate(d_day_h = get(d_day)) %>%
        select(-c(d1:d_day))%>%
        ungroup)}
       )
#  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
#  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
#1 Gki        288.74µs 306.78µs     3198.        0B     12.5  1538     6
#2 KarthikS     9.19ms   9.57ms      101.    25.1KB     11.2    45     5
#3 GuedesBF        6ms   6.52ms      151.    13.5KB     10.9    69     5

Currently method of GKi is about 20 times faster and uses less memory than the method form GuedesBF and about 30 times faster than the method from KarthikS.

Upvotes: 1

GuedesBF
GuedesBF

Reputation: 9858

A very simple approach with get():

library(dplyr)

df %>% 
        rowwise %>%
        mutate(d_day_h = get(d_day)) %>%
        select(-c(d1:d_day))%>%
        ungroup

# A tibble: 2 x 3
     id  id_1 d_day_h
  <dbl> <dbl>   <dbl>
1   101     1      25
2   101     2       0

Upvotes: 1

Karthik S
Karthik S

Reputation: 11584

Does this work:

library(dplyr)
library(tidyr)
df %>% pivot_longer(-c(id, id_1, d_day)) %>% filter(d_day == name) %>% 
     select(id, id_1, 'd_day_h' = value)
# A tibble: 2 x 3
     id  id_1 d_day_h
  <dbl> <dbl>   <dbl>
1   101     1      25
2   101     2       0

Upvotes: 1

Related Questions