GenesRus
GenesRus

Reputation: 1057

Split string by position and numeric value in dplyr

I'm attempting read in another group's data and there's one column where a pair of two numbers is usually separated by a comma, but in ~10% of cases (>15,000) there is no comma. It's too many to separate by hand, but I'm having trouble devising an efficient way to separate the string efficiently. Here are the rules:

  1. string will be 5 or 6 characters, corresponding to two numbers that are each 2 or 3 characters
  2. total if you add the two numbers will be roughly 150 to 250
  3. maximum of either side will be 150
  4. individual values should be maximized together, e.g. id = 8 would be 101 and 48 NOT 10 and 148, and if there's a tie, pick the pair with the largest small value, e.g. 11121 should be 111 and 21 NOT 11 and 121

Here's a sample:

tribble(
  ~id, ~to_split,
  1,   33118,
  2,   37118,
  3,   30121,
  4,   41110,
  5,   98121,
  6,   101102,
  7,   101110,
  8,   10148,
  9,   11121
) %>% 
  mutate_at("to_split", as.character)

And here's the result I'd like to get out:

tribble(
  ~id, ~left, ~right,
  1,   33, 118,
  2,   37, 118,
  3,   30, 121,
  4,   41, 110,
  5,   98, 121,
  6,   101, 102,
  7,   101, 110,
  8,   101, 48,
  9,   111, 21
)

I've thought about just splitting them into two different pairings by taking the first two characters and the remainder and the first three and the remainder and then adding those two pairings to see which total is the smallest. While I think this heuristic would work for everything, it's kind of a pain to implement. I'm thinking here of making two separate calls (with keep = TRUE), then presumably pivot_long and group_by id in order to filter the right option. Are there simpler options?

Ideally, I'd like to do this all in a mutate (possibly requiring purrr::map?) but I'm not sure how to split a string by position AND get both parts of it (without separate). If that is possible, it should be a simple matter of a case_when/ifelse to determine which pairing is smaller and then adding in a comma or other separator between them at the right position so I can bind_rows back to the rest of the table where I'll separate everything. But the only way I've found to separate a string and get both parts is to separate. Will something from stringr or stringi do this?

Upvotes: 0

Views: 1488

Answers (2)

Jon Spring
Jon Spring

Reputation: 66490

Here's a verbose approach that might be helpful as you refine for edge cases.

library(tidyverse)

# combine splits @ 2 and @ 3
bind_rows(df1 %>% mutate(split_pos = 2),
          df1 %>% mutate(split_pos = 3)) %>%
  
# calc features
  mutate(num1 = str_sub(to_split,   end = split_pos)     %>% parse_number(),
         num2 = str_sub(to_split, start = split_pos + 1) %>% parse_number(),
         total = num1 + num2,
         max = pmax(num1, num2)) %>%
  
# filter and pick best fit
  filter(total %>% between(100, 250), max <= 150) %>%
  arrange(id) %>%
  group_by(id) %>%
  slice_min(max) %>%
  ungroup()



    id to_split split_pos  num1  num2 total   max
  <dbl> <chr>        <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 33118            2    33   118   151   118
2     2 37118            2    37   118   155   118
3     3 30121            2    30   121   151   121
4     4 41110            2    41   110   151   110
5     5 98121            2    98   121   219   121
6     6 101102           3   101   102   203   102
7     7 101110           3   101   110   211   110
8     8 10148            3   101    48   149   101
9     9 11121            3   111    21   132   111

Upvotes: 1

Waldi
Waldi

Reputation: 41220

This works on example dataset:

library(dplyr)

df %>% mutate(cut = if_else(as.numeric(substr(to_split,1,3))<=150,3L,2L),
              length = nchar(to_split)) %>%
       mutate(left  = substr(to_split,1,cut),
              right = substr(to_split,cut+1,length)) %>%
       select(-cut,-length)

# A tibble: 8 x 4
     id to_split left  right
  <dbl> <chr>    <chr> <chr>
1     1 33118    33    118  
2     2 37118    37    118  
3     3 30121    30    121  
4     4 41110    41    110  
5     5 98121    98    121  
6     6 101102   101   102  
7     7 101110   101   110  
8     8 10148    101   48  

Upvotes: 1

Related Questions