Reputation: 13
I am trying to write a decently complex iterative matching function but I am drowning in ifelse and for functions that do not work. Unfortunately I don't have anyone to bounce ideas off so any support or thoughts are appreciated.
Each row of my data is an observation with many variables, pertinent ones are included in this example. The observation has an assigned Sample_Name
, a Matching_Group
corresponding to the sample name, measurements of Time
, and a subjective Assigned_idx
which is partially complete from earlier parts in the data cleaning. Each observed Sample_Name
can contain 0-7 observations, but the Matching_Group
will always contain 7 observations.
structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
Sample_Name Matching_Group Time Assigned_idx
A QQ 1.000
A QQ 1.100
A QQ 1.200
A QQ 1.400
A QQ 1.600
B SS 7.203
B SS 7.395
B SS 7.500
B SS 7.600
B SS 7.700
B SS 7.802
QQ QQ 1.000 1
QQ QQ 1.102 2
QQ QQ 1.200 3
QQ QQ 1.300 4
QQ QQ 1.398 5
QQ QQ 1.501 6
QQ QQ 1.600 7
SS SS 7.200 1
SS SS 7.300 2
SS SS 7.400 3
SS SS 7.500 4
SS SS 7.600 5
SS SS 7.700 6
SS SS 7.800 7
For each observation (row), I want to calculate a ratio of Time
between each row of the corresponding Matching_Group
. Each Matching_Group
will have an assigned unique Time_Ratio
value that the calculation needs to be equal to +/- some tolerance. If that calculated ratio matches the pre-defined ratio specific to the group, I want to extract and assign the Assigned_idx
from the row belonging to the observations of the Matching_Group
and assign it to the observation. If not, repeat calculation with the same observed Time
and the Time
in the next row of Matching_Group
. Repeat until every observation has a value in Assigned_idx
.
Example: In this data set, for both Matching_Group
the Time_Ratio
should be equal to 1.000 +/- 0.0020
. In my real data set, there will be unique Time_Ratio
values per Matching_Group
designated in a separate table. So for Row 3 with Time
= 1.200
, the Matching_Group
is QQ
. When we calculate the ratio with the first QQ
observed time, 1.200/1.000 = 1.200
which is outside of our defined tolerance --> next observed time of QQ
. 1.200/1.102 = 1.089
...again outside of our tolerance. Finally though, 1.200/1.200 = 1.000
which indeed falls within our designated tolerance for this Matching_Group
. In the row of the observation for Matching_Group
that has the matching ratio, the Assigned_idx
column holds 3
. We take this value, and map it into the Assigned_idx
column for Row 3. Then repeat this for Row 4 and iterate the process.
Desired outcome:
Sample_Name Matching_Group Time Assigned_idx Time_Ratio (Sample:Matching)
A QQ 1.000 1 1.0000
A QQ 1.100 2 0.9982
A QQ 1.200 3 1.0000
A QQ 1.400 5 1.0014
A QQ 1.600 7 1.0000
B SS 7.203 1 1.0004
B SS 7.395 3 0.9993
B SS 7.500 4 1.0000
B SS 7.600 5 1.0000
B SS 7.700 6 1.0000
B SS 7.802 7 1.0003
QQ QQ 1.000 1 1.0000
QQ QQ 1.102 2 1.0000
QQ QQ 1.200 3 1.0000
QQ QQ 1.300 4 1.0000
QQ QQ 1.398 5 1.0000
QQ QQ 1.501 6 1.0000
QQ QQ 1.600 7 1.0000
SS SS 7.200 1 1.0000
SS SS 7.300 2 1.0000
SS SS 7.400 3 1.0000
SS SS 7.500 4 1.0000
SS SS 7.600 5 1.0000
SS SS 7.700 6 1.0000
SS SS 7.800 7 1.0000
I have tried approaching this using dplyr as I figured it should be able to handle what I am trying to accomplish (perhaps purrr is better suited?). Unfortunately I just can't seem to sequence my conditions and my expressions appropriately within ifelse and for functions. My attempts have included a mishmash of grouping %>% mutate with the ratio calculation, data.table::shift, etc. but I just can't seem to get it to work with my condition parameters. Also in case it is pertinent, in my real data there will be ~50 "Names" and ~25 matching groups. I will have a second data source listing the matching group names and respective ratio but did not include such detail in this example.
I'm honestly stumped, any thoughts are appreciated.
Upvotes: 1
Views: 117
Reputation: 21274
UPDATE
First version was pretty clunky, here's a cleaner second pass:
library(tidyverse)
thresh <- .002
baseline <- 1.0
Still making compare
, but now it's just two rows: one for each matching group, with times
as a list of all the times per each Matching_Group
:
compare <- df %>%
filter(Sample_Name == Matching_Group) %>%
group_by(Matching_Group) %>%
summarise(times = list(Time))
compare
Matching_Group times
<chr> <list>
1 QQ <dbl [7]>
2 SS <dbl [7]>
Join df
with compare
, then use purrr::map()
variants to get ratios, deltas (from baseline), and then the very handy detect_index()
can give us the first match of sub-threshold ratio. (Note: This also solves the question from your comments about having thresh
and baseline
that are different for each matching group - we're still using static values here but the operations all assume those two variables are now columns in the df, which could in theory be different for each row or group.)
df %>%
mutate(thresh = thresh,
baseline = baseline) %>%
inner_join(compare, by = "Matching_Group") %>%
mutate(ratios = map2(Time, times, ~ .x / .y),
deltas = map2(baseline, ratios, ~ abs(.x - .y)),
Assigned_idx = map2_dbl(deltas, thresh,
~detect_index(.x, ~ .x < .y, .y))) %>%
select(-times, -ratios, -deltas)
Output:
Sample_Name Matching_Group Time Assigned_idx thresh baseline
<chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 A QQ 1.00 1. 0.00200 1.
2 A QQ 1.10 2. 0.00200 1.
3 A QQ 1.20 3. 0.00200 1.
4 A QQ 1.40 5. 0.00200 1.
5 A QQ 1.60 7. 0.00200 1.
6 B SS 7.20 1. 0.00200 1.
7 B SS 7.40 3. 0.00200 1.
8 B SS 7.50 4. 0.00200 1.
9 B SS 7.60 5. 0.00200 1.
10 B SS 7.70 6. 0.00200 1.
# ... with 15 more rows
Original solution
Here's a tidyverse
solution. The idea is to swing out Sample_Name
into wide form (that's compare
), and then get the ratios for each row (and evaluate whether they pass the thresh
test). Then it's just a matter of recombining and cleaning up unnecessary variables.
library(stringr)
library(tidyverse)
thresh <- .002
baseline <- 1.0
First, create df
by adding name2
to data
. It's just a copy of Sample_Name
but with index values added:
df <- data %>%
group_by(Sample_Name) %>%
mutate(name2 = paste0(Sample_Name, 1:length(Sample_Name))) %>%
ungroup()
df
# A tibble: 25 x 5
Sample_Name Matching_Group Time Assigned_idx name2
<chr> <chr> <dbl> <dbl> <chr>
1 A QQ 1.00 NA A1
2 A QQ 1.10 NA A2
3 A QQ 1.20 NA A3
4 A QQ 1.40 NA A4
5 A QQ 1.60 NA A5
6 B SS 7.20 NA B1
...
Now create the compare
data frame:
compare <- df %>%
select(name2, Time) %>%
spread(name2, value = Time)
compare
# A tibble: 1 x 25
A1 A2 A3 A4 A5 B1 B2 B3 B4 B5 B6 QQ1 QQ2
* <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1. 1.10 1.20 1.40 1.60 7.20 7.40 7.50 7.60 7.70 7.80 1. 1.10
# ... with 12 more variables: QQ3 <dbl>, QQ4 <dbl>, QQ5 <dbl>, QQ6 <dbl>,
# QQ7 <dbl>, SS1 <dbl>, SS2 <dbl>, SS3 <dbl>, SS4 <dbl>, SS5 <dbl>,
# SS6 <dbl>, SS7 <dbl>
Use purrr:pmap
to compute ratios and compare against thresh
:
matched_df <- df %>%
pmap(~ compare %>%
select(starts_with(..2)) %>%
mutate_all(funs(..3/., which(abs(baseline - ./..3 ) < thresh)[1])) %>%
select(contains("_"))
) %>%
bind_rows(.)
matched_df
# A tibble: 25 x 28
`QQ1_/` `QQ2_/` `QQ3_/` `QQ4_/` `QQ5_/` `QQ6_/` `QQ7_/` `QQ1_[` `QQ2_[`
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int> <int>
1 1.00 0.907 0.833 0.769 0.715 0.666 0.625 1 NA
2 1.10 0.998 0.917 0.846 0.787 0.733 0.688 NA 1
3 1.20 1.09 1.00 0.923 0.858 0.799 0.750 NA NA
4 1.40 1.27 1.17 1.08 1.00 0.933 0.875 NA NA
5 1.60 1.45 1.33 1.23 1.14 1.07 1.00 NA NA
Finally, bind matched_df
to df
and clean up.
The key operation that narrows down to just the correctly matched indexes is filter(Assigned_idx == matched2)
. Up until that point, all of the possible ratios for each Sample_Name
-to-Matching_Group
assignment are present.
bind_cols(df, matched_df) %>%
select(-name2, -Assigned_idx) %>%
gather(Assigned_idx, value, -contains("/"), -Sample_Name, -Matching_Group, -Time) %>%
filter(!is.na(value)) %>%
gather(matched2, Time_Ratio, -Assigned_idx, -value, -Sample_Name, -Matching_Group, -Time) %>%
mutate(Assigned_idx = str_replace(Assigned_idx, "_\\[", ""),
matched2 = str_replace(matched2, "_/", "")) %>%
filter(Assigned_idx == matched2) %>%
arrange(Sample_Name) %>%
select(-value, -matched2) %>%
mutate(Assigned_idx = str_sub(Assigned_idx, -1),
Time_Ratio = round(Time_Ratio, 4))
Sample_Name Matching_Group Time Assigned_idx Time_Ratio
1 A QQ 1.000 1 1.0000
2 A QQ 1.100 2 0.9982
3 A QQ 1.200 3 1.0000
4 A QQ 1.400 5 1.0014
5 A QQ 1.600 7 1.0000
6 B SS 7.203 1 1.0004
7 B SS 7.395 3 0.9993
8 B SS 7.500 4 1.0000
...
Not my prettiest solution...for all the tidyverse
wizards out there, happy to learn from any suggestions.
Data:
data <- structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"))
Upvotes: 0
Reputation: 446
Something like this should work:
#!/usr/bin/R
a = structure(list(Sample_Name = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS",
"SS", "SS", "SS", "SS", "SS", "SS"), Matching_Group = c("QQ",
"QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS", "SS", "QQ",
"QQ", "QQ", "QQ", "QQ", "QQ", "QQ", "SS", "SS", "SS", "SS", "SS",
"SS", "SS"), Time = c(1, 1.1, 1.2, 1.4, 1.6, 7.203, 7.395,
7.5, 7.6, 7.7, 7.802, 1, 1.102, 1.2, 1.3, 1.398, 1.501, 1.6,
7.2, 7.3, 7.4, 7.5, 7.6, 7.7, 7.8), Assigned_idx = c(NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, 1, 2, 3, 4, 5, 6, 7, 1, 2,
3, 4, 5, 6, 7)), row.names = c(NA, -25L), class = c("tbl_df",
"tbl", "data.frame"));
tol = 0.002;
a$Time_Ratio <- NA;
for (i in 1:nrow(a)) {
s_name <- a[i, "Sample_Name"];
mg <- a[i, "Matching_Group"];
s_time <- a[i, "Time"];
for (j in 1:nrow(a)) {
mg_name <- a[j, "Sample_Name"];
if (mg_name == mg) {
mg_time <- a[j, "Time"];
time_ratio = s_time/mg_time;
if (abs(time_ratio - 1.0) < tol) {
a[i, "Assigned_idx"] <- a[j, "Assigned_idx"];
a[i, "Time_Ratio"] <- time_ratio;
break;
}
}
}
}
print(a);
Upvotes: 0