user11352627
user11352627

Reputation: 127

Match strings before special character

I am trying to match strings in two columns and return mismatches before ":". It should not return if x2x, y67y, as x remains x and y remains as y.

I don't want to match the ":decimal". If x2y is in both columns then its a match (irrespective of the mismatch in the decimal after special character) INPUT:

input <- structure(list(x = structure(c(1L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), y = structure(c(2L, 3L, 1L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor"), x_val = c("x2x:0.12345,y67h:0.06732,d7j:0.032647", 
"x2y:0.26345,y67y:0.28320,d7r:0.043647", "x2y:0.23435,y67y:0.28310,d7r:0.043547", 
"x2y:0.23435,y67y:0.28330,d7r:0.043247"), y_val = c("x2y:0.33134,y67y:0.3131,d7r:0.23443", 
"x2y:0.34311,y67y:0.14142,d7r:0.31431", "x2x:0.34314,y67h:0.14141,d7j:0.453145", 
"x67b:0.31411,g72v:0.3134,b8c:0.89234")), row.names = c(NA, -4L
), class = "data.frame")

Output:

output <- structure(list(x = structure(c(1L, 2L, 3L, 3L), .Label = c("A", 
"B", "C"), class = "factor"), y = structure(c(2L, 3L, 1L, 4L), .Label = c("A", 
"B", "C", "D"), class = "factor"), x_val = c("x2x:0.12345,y67h:0.06732,d7j:0.032647", 
"x2y:0.26345,y67y:0.28320,d7r:0.043647", "x2y:0.23435,y67y:0.28310,d7r:0.043547", 
"x2y:0.23435,y67y:0.28330,d7r:0.043247"), y_val = c("x2y:0.33134,y67y:0.3131,d7r:0.23443", 
"x2y:0.34311,y67y:0.14142,d7r:0.31431", "x2x:0.34314,y67h:0.14141,d7j:0.453145", 
"x67b:0.31411,g72v:0.3134,b8c:0.89234"), diff_x = c("y67h:0.06732,d7j:0.03264", 
NA, "x2y:0.23435,d7r:0.043547", "x2y:0.23435,y67y:0.28330,d7r:0.043247"
), diff_y = c("x2y:0.33134,d7r:0.23443", NA, "y67h:0.14141,d7j:0.453145", 
"x67b:0.31411,g72v:0.3134,b8c:0.89234")), row.names = c(NA, -4L
), class = "data.frame")

I run into problem when I just want to match till ":" character. The following code is taken from this question: https://stackoverflow.com/a/55285959/5150629.

library(dplyr)
library(purrr)

I %>% mutate(diff_x = map2_chr(strsplit(x_val, split = ", "), 
                               strsplit(y_val, split = ", "), 
                               ~paste(grep('([a-z])(?>\\d+)(?!\\1)', setdiff(.x, .y), 
                                           value = TRUE, perl = TRUE), 
                                           collapse = ", ")) %>%
               replace(. == "", NA), 
             diff_y = map2_chr(strsplit(x_val, split = ", "), 
                               strsplit(y_val, split = ", "), 
                               ~paste(grep('([a-z])(?>\\d+)(?!\\1)', setdiff(.y, .x), 
                                           value = TRUE, perl = TRUE),
                                           collapse = ", ")) %>%
               replace(. == "", NA))

Can anyone help?Thanks!

Upvotes: 1

Views: 90

Answers (1)

acylam
acylam

Reputation: 18681

I modified my answer in https://stackoverflow.com/a/55285959/5150629 to fit this question:

library(dplyr)
library(purrr)

df %>% 
  mutate(
    diff_x = map2_chr(
      strsplit(x_val, split = ","), 
      strsplit(y_val, split = ","), 
      ~ {
        setdiff(sub(":.+$", "", .x), sub(":.+$", "", .y)) %>%
          grep('([a-z])(?>\\d+)(?!\\1)', ., value = TRUE, perl = TRUE) %>%
          sapply(grep, .x, value = TRUE) %>%
          paste(collapse = ", ") %>%
          replace(. == "", NA)
      }
    ),  
    diff_y = map2_chr(
      strsplit(x_val, split = ","), 
      strsplit(y_val, split = ","), 
      ~ {
        setdiff(sub(":.+$", "", .y), sub(":.+$", "", .x)) %>%
          grep('([a-z])(?>\\d+)(?!\\1)', ., value = TRUE, perl = TRUE) %>%
          sapply(grep, .y, value = TRUE) %>%
          paste(collapse = ", ") %>%
          replace(. == "", NA)
      }
    )
  )

Output:

  x y                                 x_val                                 y_val                     diff_x
1 A B x2x:0.12345,y67h:0.06732,d7j:0.032647   x2y:0.33134,y67y:0.3131,d7r:0.23443 y67h:0.06732, d7j:0.032647
2 B C x2y:0.26345,y67y:0.28320,d7r:0.043647  x2y:0.34311,y67y:0.14142,d7r:0.31431                       <NA>
3 C A x2y:0.23435,y67y:0.28310,d7r:0.043547 x2x:0.34314,y67h:0.14141,d7j:0.453145  x2y:0.23435, d7r:0.043547
4 C D x2y:0.23435,y67y:0.28330,d7r:0.043247  x67b:0.31411,g72v:0.3134,b8c:0.89234  x2y:0.23435, d7r:0.043247
                                  diff_y
1               x2y:0.33134, d7r:0.23443
2                                   <NA>
3             y67h:0.14141, d7j:0.453145
4 x67b:0.31411, g72v:0.3134, b8c:0.89234

Notes:

  1. Since we are only interested in comparing the first part of the string format x1y:000000, I added a sub(":.+$", "", .x) for each map2_chr input argument to strip out the :000000 part first.

  2. setdiff and the following grep steps work as expected to return the mismatches and exclude strings with the form x1x.

  3. sapply(grep, .x, value = TRUE) after the first grep takes the vector of mismatches, and searches for their corresponding original strings (in x1y:000000 form).

  4. paste collapses the vector of mismatches into a single comma separated list.

Upvotes: 2

Related Questions