Split string by position and numeric value in dplyr

Question

I'm attempting read in another group's data and there's one column where a pair of two numbers is usually separated by a comma, but in ~10% of cases (>15,000) there is no comma. It's too many to separate by hand, but I'm having trouble devising an efficient way to separate the string efficiently. Here are the rules:

string will be 5 or 6 characters, corresponding to two numbers that are each 2 or 3 characters
total if you add the two numbers will be roughly 150 to 250
maximum of either side will be 150
individual values should be maximized together, e.g. id = 8 would be 101 and 48 NOT 10 and 148, and if there's a tie, pick the pair with the largest small value, e.g. 11121 should be 111 and 21 NOT 11 and 121

Here's a sample:

tribble(
  ~id, ~to_split,
  1,   33118,
  2,   37118,
  3,   30121,
  4,   41110,
  5,   98121,
  6,   101102,
  7,   101110,
  8,   10148,
  9,   11121
) %>% 
  mutate_at("to_split", as.character)

And here's the result I'd like to get out:

tribble(
  ~id, ~left, ~right,
  1,   33, 118,
  2,   37, 118,
  3,   30, 121,
  4,   41, 110,
  5,   98, 121,
  6,   101, 102,
  7,   101, 110,
  8,   101, 48,
  9,   111, 21
)

I've thought about just splitting them into two different pairings by taking the first two characters and the remainder and the first three and the remainder and then adding those two pairings to see which total is the smallest. While I think this heuristic would work for everything, it's kind of a pain to implement. I'm thinking here of making two separate calls (with keep = TRUE), then presumably pivot_long and group_by id in order to filter the right option. Are there simpler options?

Ideally, I'd like to do this all in a mutate (possibly requiring purrr::map?) but I'm not sure how to split a string by position AND get both parts of it (without separate). If that is possible, it should be a simple matter of a case_when/ifelse to determine which pairing is smaller and then adding in a comma or other separator between them at the right position so I can bind_rows back to the rest of the table where I'll separate everything. But the only way I've found to separate a string and get both parts is to separate. Will something from stringr or stringi do this?

Jon Spring · Accepted Answer

Here's a verbose approach that might be helpful as you refine for edge cases.

library(tidyverse)

# combine splits @ 2 and @ 3
bind_rows(df1 %>% mutate(split_pos = 2),
          df1 %>% mutate(split_pos = 3)) %>%
  
# calc features
  mutate(num1 = str_sub(to_split,   end = split_pos)     %>% parse_number(),
         num2 = str_sub(to_split, start = split_pos + 1) %>% parse_number(),
         total = num1 + num2,
         max = pmax(num1, num2)) %>%
  
# filter and pick best fit
  filter(total %>% between(100, 250), max <= 150) %>%
  arrange(id) %>%
  group_by(id) %>%
  slice_min(max) %>%
  ungroup()



    id to_split split_pos  num1  num2 total   max
               
1     1 33118            2    33   118   151   118
2     2 37118            2    37   118   155   118
3     3 30121            2    30   121   151   121
4     4 41110            2    41   110   151   110
5     5 98121            2    98   121   219   121
6     6 101102           3   101   102   203   102
7     7 101110           3   101   110   211   110
8     8 10148            3   101    48   149   101
9     9 11121            3   111    21   132   111

Split string by position and numeric value in dplyr

Answers (2)

Related Questions