Compare strings in dplyr

Question

My dataframe looks like:

df <- tibble::tribble(
  ~order_id, ~user_id,      ~comp_order,           ~comp_rec,
    1164320,    32924, "4-6-22-11-37-5", "4-5-6-11-22-36-37",
    1169182,    33128,  "9-4-15-28-8-7",  "4-7-8-9-28-37-38",
    1166014,    33003,    "27-22-4-6-5", "4-5-6-22-27-36-37",
    1166019,    32996,    "27-22-4-6-8", "4-6-8-22-27-36-38"
  )

I want to know what digit is present in the comp_order column and not in the comp_rec.

The final output should look like:

  order_id user_id comp_rec          comp_order     is_equal elements_removed_from_rec elements_added_to_order
                                                                           
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       36                        none                   
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       37                        none                   
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       36-37                     none                   
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       36-38                     none                   
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      none                      none                   
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       none                      33                     

df_output <- tibble::tribble(
               ~order_id, ~user_id,           ~comp_rec,      ~comp_order, ~is_equal, ~elements_removed_from_rec, ~elements_added_to_order,
                 1164320,    32924, "4-5-6-11-22-36-37", "4-6-22-11-37-5",      "no",                       "36",                   "none",
                 1169182,    33128,  "4-7-8-9-28-37-38",  "9-4-15-28-8-7",      "no",                       "37",                   "none",
                 1166014,    33003, "4-5-6-22-27-36-37",    "27-22-4-6-5",      "no",                    "36-37",                   "none",
                 1166019,    32996, "4-6-8-22-27-36-38",    "27-22-4-6-8",      "no",                    "36-38",                   "none",
                 1166012,    32922,       "27-22-4-6-8",    "27-22-4-6-8",     "yes",                     "none",                   "none",
                 1166033,    32911,       "27-22-4-6-8", "27-22-4-6-8-33",      "no",                     "none",                     "33"
               )

I would need to know:

what has been removed from rec
what has been added to order

In terms of digits in the string.

The issues that the order of the digits in the string is not necessarily the same...

How can I make a comparison between these 2 strings?

Ian Campbell · Accepted Answer

Here's an approach with purrr:

First, we use purrr:map to split the elements on - into a list of elements. Then, we use purrr:map2 to perform setdiff on the lists to identify the different elements.

If both evaluate to "", then we know they are the same, and so we can use case_when to determine the is_equal column.

Then we can clean up by removing the list columns.

library(dplyr)
library(purrr)
df %>%
  mutate(comp_order_list = map(comp_order, ~str_split(.,"-", simplify = TRUE)),
         comp_rec_list = map(comp_rec, ~str_split(.,"-", simplify = TRUE)),
         elements_removed = map2_chr(comp_rec_list,comp_order_list,
                                   ~ paste(setdiff(.x,.y),collapse = "-")),
         elements_added = map2_chr(comp_order_list,comp_rec_list,
                                     ~ paste(setdiff(.x,.y),collapse = "-")),
         is_equal = case_when(elements_removed == "" & elements_added == "" ~ "yes",
                              TRUE ~ "no")) %>%
  dplyr::select(order_id,user_id,comp_rec,comp_order,is_equal,elements_removed,elements_added)
# A tibble: 6 x 7
  order_id user_id comp_rec          comp_order     is_equal elements_removed elements_added
                                                         
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       "36"             ""            
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       "37-38"          "15"          
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       "36-37"          ""            
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       "36-38"          ""            
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      ""               ""            
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       ""               "33"

Compare strings in dplyr

Answers (2)

Related Questions