chopin_is_the_best
chopin_is_the_best

Reputation: 2101

Compare strings in dplyr

My dataframe looks like:

df <- tibble::tribble(
  ~order_id, ~user_id,      ~comp_order,           ~comp_rec,
    1164320,    32924, "4-6-22-11-37-5", "4-5-6-11-22-36-37",
    1169182,    33128,  "9-4-15-28-8-7",  "4-7-8-9-28-37-38",
    1166014,    33003,    "27-22-4-6-5", "4-5-6-22-27-36-37",
    1166019,    32996,    "27-22-4-6-8", "4-6-8-22-27-36-38"
  )

I want to know what digit is present in the comp_order column and not in the comp_rec.

The final output should look like:

  order_id user_id comp_rec          comp_order     is_equal elements_removed_from_rec elements_added_to_order
     <dbl>   <dbl> <chr>             <chr>          <chr>    <chr>                     <chr>                  
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       36                        none                   
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       37                        none                   
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       36-37                     none                   
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       36-38                     none                   
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      none                      none                   
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       none                      33                     

df_output <- tibble::tribble(
               ~order_id, ~user_id,           ~comp_rec,      ~comp_order, ~is_equal, ~elements_removed_from_rec, ~elements_added_to_order,
                 1164320,    32924, "4-5-6-11-22-36-37", "4-6-22-11-37-5",      "no",                       "36",                   "none",
                 1169182,    33128,  "4-7-8-9-28-37-38",  "9-4-15-28-8-7",      "no",                       "37",                   "none",
                 1166014,    33003, "4-5-6-22-27-36-37",    "27-22-4-6-5",      "no",                    "36-37",                   "none",
                 1166019,    32996, "4-6-8-22-27-36-38",    "27-22-4-6-8",      "no",                    "36-38",                   "none",
                 1166012,    32922,       "27-22-4-6-8",    "27-22-4-6-8",     "yes",                     "none",                   "none",
                 1166033,    32911,       "27-22-4-6-8", "27-22-4-6-8-33",      "no",                     "none",                     "33"
               )

I would need to know:

  1. what has been removed from rec
  2. what has been added to order

In terms of digits in the string.

The issues that the order of the digits in the string is not necessarily the same...

How can I make a comparison between these 2 strings?

Upvotes: 3

Views: 1112

Answers (2)

Ian Campbell
Ian Campbell

Reputation: 24790

Here's an approach with purrr:

First, we use purrr:map to split the elements on - into a list of elements. Then, we use purrr:map2 to perform setdiff on the lists to identify the different elements.

If both evaluate to "", then we know they are the same, and so we can use case_when to determine the is_equal column.

Then we can clean up by removing the list columns.

library(dplyr)
library(purrr)
df %>%
  mutate(comp_order_list = map(comp_order, ~str_split(.,"-", simplify = TRUE)),
         comp_rec_list = map(comp_rec, ~str_split(.,"-", simplify = TRUE)),
         elements_removed = map2_chr(comp_rec_list,comp_order_list,
                                   ~ paste(setdiff(.x,.y),collapse = "-")),
         elements_added = map2_chr(comp_order_list,comp_rec_list,
                                     ~ paste(setdiff(.x,.y),collapse = "-")),
         is_equal = case_when(elements_removed == "" & elements_added == "" ~ "yes",
                              TRUE ~ "no")) %>%
  dplyr::select(order_id,user_id,comp_rec,comp_order,is_equal,elements_removed,elements_added)
# A tibble: 6 x 7
  order_id user_id comp_rec          comp_order     is_equal elements_removed elements_added
     <dbl>   <dbl> <chr>             <chr>          <chr>    <chr>            <chr>         
1  1164320   32924 4-5-6-11-22-36-37 4-6-22-11-37-5 no       "36"             ""            
2  1169182   33128 4-7-8-9-28-37-38  9-4-15-28-8-7  no       "37-38"          "15"          
3  1166014   33003 4-5-6-22-27-36-37 27-22-4-6-5    no       "36-37"          ""            
4  1166019   32996 4-6-8-22-27-36-38 27-22-4-6-8    no       "36-38"          ""            
5  1166012   32922 27-22-4-6-8       27-22-4-6-8    yes      ""               ""            
6  1166033   32911 27-22-4-6-8       27-22-4-6-8-33 no       ""               "33"    

Upvotes: 3

coffeinjunky
coffeinjunky

Reputation: 11514

Here one way of doing it that utilises setdiff at its core:

# These two lines split the strings and store them in one cell in the dataframe
df$digits_order <- str_split(df$comp_order, "-")
df$digits_rec <- str_split(df$comp_rec, "-")

# The following apply function iterates over rows and applies setdiff. 
# paste is used to stitch them together if apply returns multiple digits.  
df$in_rec_but_not_order <- apply(df, 1, function(row) paste(setdiff(row$digits_rec, row$digits_order), collapse = "-"))
df$in_order_but_not_rec <- apply(df, 1, function(row) paste(setdiff(row$digits_order, row$digits_rec), collapse = "-"))

df
# A tibble: 4 x 8
  order_id user_id comp_order     comp_rec          digits_order digits_rec in_order_but_not_rec in_rec_but_not_order
     <dbl>   <dbl> <chr>          <chr>             <list>       <list>     <chr>                <chr>               
1  1164320   32924 4-6-22-11-37-5 4-5-6-11-22-36-37 <chr [6]>    <chr [7]>  ""                   36                  
2  1169182   33128 9-4-15-28-8-7  4-7-8-9-28-37-38  <chr [6]>    <chr [7]>  15                   37-38               
3  1166014   33003 27-22-4-6-5    4-5-6-22-27-36-37 <chr [5]>    <chr [7]>  ""                   36-37               
4  1166019   32996 27-22-4-6-8    4-6-8-22-27-36-38 <chr [5]>    <chr [7]>  ""                   36-38               

Here, the last column contains the digits missing from the record.

Upvotes: 2

Related Questions