Complex ID Assignment with conditions, iterative calculations, and tolerance matching

Question

I am trying to write a decently complex iterative matching function but I am drowning in ifelse and for functions that do not work. Unfortunately I don't have anyone to bounce ideas off so any support or thoughts are appreciated.

My Data Structure

Each row of my data is an observation with many variables, pertinent ones are included in this example. The observation has an assigned Sample_Name, a Matching_Group corresponding to the sample name, measurements of Time, and a subjective Assigned_idx which is partially complete from earlier parts in the data cleaning. Each observed Sample_Name can contain 0-7 observations, but the Matching_Group will always contain 7 observations.

structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", 
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ", 
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ", 
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", 
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395, 
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6, 
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2, 
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df", 
"tbl", "data.frame"))


Sample_Name Matching_Group  Time        Assigned_idx

A           QQ              1.000   
A           QQ              1.100   
A           QQ              1.200   
A           QQ              1.400   
A           QQ              1.600   
B           SS              7.203   
B           SS              7.395   
B           SS              7.500   
B           SS              7.600   
B           SS              7.700   
B           SS              7.802   
QQ          QQ              1.000       1
QQ          QQ              1.102       2
QQ          QQ              1.200       3
QQ          QQ              1.300       4
QQ          QQ              1.398       5
QQ          QQ              1.501       6
QQ          QQ              1.600       7
SS          SS              7.200       1
SS          SS              7.300       2
SS          SS              7.400       3
SS          SS              7.500       4
SS          SS              7.600       5
SS          SS              7.700       6
SS          SS              7.800       7

My Problem

For each observation (row), I want to calculate a ratio of Time between each row of the corresponding Matching_Group. Each Matching_Group will have an assigned unique Time_Ratio value that the calculation needs to be equal to +/- some tolerance. If that calculated ratio matches the pre-defined ratio specific to the group, I want to extract and assign the Assigned_idx from the row belonging to the observations of the Matching_Group and assign it to the observation. If not, repeat calculation with the same observed Time and the Time in the next row of Matching_Group. Repeat until every observation has a value in Assigned_idx.

Example: In this data set, for both Matching_Group the Time_Ratio should be equal to 1.000 +/- 0.0020. In my real data set, there will be unique Time_Ratio values per Matching_Group designated in a separate table. So for Row 3 with Time = 1.200, the Matching_Group is QQ. When we calculate the ratio with the first QQ observed time, 1.200/1.000 = 1.200 which is outside of our defined tolerance --> next observed time of QQ. 1.200/1.102 = 1.089...again outside of our tolerance. Finally though, 1.200/1.200 = 1.000 which indeed falls within our designated tolerance for this Matching_Group. In the row of the observation for Matching_Group that has the matching ratio, the Assigned_idx column holds 3. We take this value, and map it into the Assigned_idx column for Row 3. Then repeat this for Row 4 and iterate the process.

Desired outcome:

Sample_Name Matching_Group  Time        Assigned_idx    Time_Ratio (Sample:Matching) 

A           QQ              1.000       1               1.0000
A           QQ              1.100       2               0.9982
A           QQ              1.200       3               1.0000
A           QQ              1.400       5               1.0014
A           QQ              1.600       7               1.0000
B           SS              7.203       1               1.0004
B           SS              7.395       3               0.9993
B           SS              7.500       4               1.0000
B           SS              7.600       5               1.0000
B           SS              7.700       6               1.0000
B           SS              7.802       7               1.0003
QQ          QQ              1.000       1               1.0000
QQ          QQ              1.102       2               1.0000
QQ          QQ              1.200       3               1.0000
QQ          QQ              1.300       4               1.0000
QQ          QQ              1.398       5               1.0000
QQ          QQ              1.501       6               1.0000
QQ          QQ              1.600       7               1.0000
SS          SS              7.200       1               1.0000
SS          SS              7.300       2               1.0000
SS          SS              7.400       3               1.0000
SS          SS              7.500       4               1.0000
SS          SS              7.600       5               1.0000
SS          SS              7.700       6               1.0000
SS          SS              7.800       7               1.0000

I have tried approaching this using dplyr as I figured it should be able to handle what I am trying to accomplish (perhaps purrr is better suited?). Unfortunately I just can't seem to sequence my conditions and my expressions appropriately within ifelse and for functions. My attempts have included a mishmash of grouping %>% mutate with the ratio calculation, data.table::shift, etc. but I just can't seem to get it to work with my condition parameters. Also in case it is pertinent, in my real data there will be ~50 "Names" and ~25 matching groups. I will have a second data source listing the matching group names and respective ratio but did not include such detail in this example.

I'm honestly stumped, any thoughts are appreciated.

andrew_reece · Accepted Answer

UPDATE
First version was pretty clunky, here's a cleaner second pass:

library(tidyverse)
thresh <- .002
baseline <- 1.0

Still making compare, but now it's just two rows: one for each matching group, with times as a list of all the times per each Matching_Group:

compare <- df %>%
  filter(Sample_Name == Matching_Group) %>% 
  group_by(Matching_Group) %>%
  summarise(times = list(Time)) 

compare
  Matching_Group times    
               
1 QQ             
2 SS

Join df with compare, then use purrr::map() variants to get ratios, deltas (from baseline), and then the very handy detect_index() can give us the first match of sub-threshold ratio. (Note: This also solves the question from your comments about having thresh and baseline that are different for each matching group - we're still using static values here but the operations all assume those two variables are now columns in the df, which could in theory be different for each row or group.)

df %>%
  mutate(thresh = thresh,
         baseline = baseline) %>%
  inner_join(compare, by = "Matching_Group") %>%
  mutate(ratios = map2(Time, times, ~ .x / .y),
         deltas = map2(baseline, ratios, ~ abs(.x - .y)),
         Assigned_idx = map2_dbl(deltas, thresh, 
                                 ~detect_index(.x, ~ .x < .y, .y))) %>%
  select(-times, -ratios, -deltas)

Output:

   Sample_Name Matching_Group  Time Assigned_idx  thresh baseline
                                   
 1 A           QQ              1.00           1. 0.00200       1.
 2 A           QQ              1.10           2. 0.00200       1.
 3 A           QQ              1.20           3. 0.00200       1.
 4 A           QQ              1.40           5. 0.00200       1.
 5 A           QQ              1.60           7. 0.00200       1.
 6 B           SS              7.20           1. 0.00200       1.
 7 B           SS              7.40           3. 0.00200       1.
 8 B           SS              7.50           4. 0.00200       1.
 9 B           SS              7.60           5. 0.00200       1.
10 B           SS              7.70           6. 0.00200       1.
# ... with 15 more rows

Original solution

Here's a tidyverse solution. The idea is to swing out Sample_Name into wide form (that's compare), and then get the ratios for each row (and evaluate whether they pass the thresh test). Then it's just a matter of recombining and cleaning up unnecessary variables.

library(stringr)
library(tidyverse)

thresh <- .002
baseline <- 1.0

First, create df by adding name2 to data. It's just a copy of Sample_Name but with index values added:

df <- data %>%
  group_by(Sample_Name) %>%
  mutate(name2 = paste0(Sample_Name, 1:length(Sample_Name))) %>%
  ungroup() 

df
# A tibble: 25 x 5
   Sample_Name Matching_Group  Time Assigned_idx name2
                             
 1 A           QQ              1.00           NA A1   
 2 A           QQ              1.10           NA A2   
 3 A           QQ              1.20           NA A3   
 4 A           QQ              1.40           NA A4   
 5 A           QQ              1.60           NA A5   
 6 B           SS              7.20           NA B1 
 ...

Now create the compare data frame:

compare <- df %>% 
      select(name2, Time) %>%
      spread(name2, value = Time)

compare
# A tibble: 1 x 25
     A1    A2    A3    A4    A5    B1    B2    B3    B4    B5    B6   QQ1   QQ2
*             
1    1.  1.10  1.20  1.40  1.60  7.20  7.40  7.50  7.60  7.70  7.80    1.  1.10
# ... with 12 more variables: QQ3 , QQ4 , QQ5 , QQ6 ,
#   QQ7 , SS1 , SS2 , SS3 , SS4 , SS5 ,
#   SS6 , SS7

Use purrr:pmap to compute ratios and compare against thresh:

matched_df <- df %>%
  pmap(~ compare %>% 
         select(starts_with(..2)) %>% 
         mutate_all(funs(..3/., which(abs(baseline - ./..3 ) < thresh)[1])) %>%
         select(contains("_"))
       ) %>% 
  bind_rows(.) 

matched_df
# A tibble: 25 x 28
   `QQ1_/` `QQ2_/` `QQ3_/` `QQ4_/` `QQ5_/` `QQ6_/` `QQ7_/` `QQ1_[` `QQ2_[`
                             
 1    1.00   0.907   0.833   0.769   0.715   0.666   0.625       1      NA
 2    1.10   0.998   0.917   0.846   0.787   0.733   0.688      NA       1
 3    1.20   1.09    1.00    0.923   0.858   0.799   0.750      NA      NA
 4    1.40   1.27    1.17    1.08    1.00    0.933   0.875      NA      NA
 5    1.60   1.45    1.33    1.23    1.14    1.07    1.00       NA      NA

Finally, bind matched_df to df and clean up.
The key operation that narrows down to just the correctly matched indexes is filter(Assigned_idx == matched2). Up until that point, all of the possible ratios for each Sample_Name-to-Matching_Group assignment are present.

bind_cols(df, matched_df) %>%
  select(-name2, -Assigned_idx) %>%
  gather(Assigned_idx, value, -contains("/"), -Sample_Name, -Matching_Group, -Time) %>%
  filter(!is.na(value)) %>%
  gather(matched2, Time_Ratio, -Assigned_idx, -value, -Sample_Name, -Matching_Group, -Time) %>%
  mutate(Assigned_idx = str_replace(Assigned_idx, "_\[", ""),
         matched2 = str_replace(matched2, "_/", "")) %>%
  filter(Assigned_idx == matched2) %>%
  arrange(Sample_Name) %>%
  select(-value, -matched2) %>%
  mutate(Assigned_idx = str_sub(Assigned_idx, -1),
         Time_Ratio = round(Time_Ratio, 4))

       Sample_Name Matching_Group  Time Assigned_idx Time_Ratio
1            A             QQ 1.000            1     1.0000
2            A             QQ 1.100            2     0.9982
3            A             QQ 1.200            3     1.0000
4            A             QQ 1.400            5     1.0014
5            A             QQ 1.600            7     1.0000
6            B             SS 7.203            1     1.0004
7            B             SS 7.395            3     0.9993
8            B             SS 7.500            4     1.0000
...

Not my prettiest solution...for all the tidyverse wizards out there, happy to learn from any suggestions.

Data:

data <- structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B", 
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", 
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ", 
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ", 
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", 
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395, 
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6, 
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2, 
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df", 
"tbl", "data.frame"))

Complex ID Assignment with conditions, iterative calculations, and tolerance matching

My Data Structure

My Problem

Answers (2)

Related Questions