Reputation: 1057
I'm attempting read in another group's data and there's one column where a pair of two numbers is usually separated by a comma, but in ~10% of cases (>15,000) there is no comma. It's too many to separate by hand, but I'm having trouble devising an efficient way to separate the string efficiently. Here are the rules:
Here's a sample:
tribble(
~id, ~to_split,
1, 33118,
2, 37118,
3, 30121,
4, 41110,
5, 98121,
6, 101102,
7, 101110,
8, 10148,
9, 11121
) %>%
mutate_at("to_split", as.character)
And here's the result I'd like to get out:
tribble(
~id, ~left, ~right,
1, 33, 118,
2, 37, 118,
3, 30, 121,
4, 41, 110,
5, 98, 121,
6, 101, 102,
7, 101, 110,
8, 101, 48,
9, 111, 21
)
I've thought about just splitting them into two different pairings by taking the first two characters and the remainder and the first three and the remainder and then adding those two pairings to see which total is the smallest. While I think this heuristic would work for everything, it's kind of a pain to implement. I'm thinking here of making two separate
calls (with keep = TRUE), then presumably pivot_long
and group_by
id in order to filter
the right option. Are there simpler options?
Ideally, I'd like to do this all in a mutate (possibly requiring purrr::map
?) but I'm not sure how to split a string by position AND get both parts of it (without separate
). If that is possible, it should be a simple matter of a case_when
/ifelse
to determine which pairing is smaller and then adding in a comma or other separator between them at the right position so I can bind_rows
back to the rest of the table where I'll separate
everything. But the only way I've found to separate a string and get both parts is to separate
. Will something from stringr
or stringi
do this?
Upvotes: 0
Views: 1488
Reputation: 66490
Here's a verbose approach that might be helpful as you refine for edge cases.
library(tidyverse)
# combine splits @ 2 and @ 3
bind_rows(df1 %>% mutate(split_pos = 2),
df1 %>% mutate(split_pos = 3)) %>%
# calc features
mutate(num1 = str_sub(to_split, end = split_pos) %>% parse_number(),
num2 = str_sub(to_split, start = split_pos + 1) %>% parse_number(),
total = num1 + num2,
max = pmax(num1, num2)) %>%
# filter and pick best fit
filter(total %>% between(100, 250), max <= 150) %>%
arrange(id) %>%
group_by(id) %>%
slice_min(max) %>%
ungroup()
id to_split split_pos num1 num2 total max
<dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 33118 2 33 118 151 118
2 2 37118 2 37 118 155 118
3 3 30121 2 30 121 151 121
4 4 41110 2 41 110 151 110
5 5 98121 2 98 121 219 121
6 6 101102 3 101 102 203 102
7 7 101110 3 101 110 211 110
8 8 10148 3 101 48 149 101
9 9 11121 3 111 21 132 111
Upvotes: 1
Reputation: 41220
This works on example dataset:
library(dplyr)
df %>% mutate(cut = if_else(as.numeric(substr(to_split,1,3))<=150,3L,2L),
length = nchar(to_split)) %>%
mutate(left = substr(to_split,1,cut),
right = substr(to_split,cut+1,length)) %>%
select(-cut,-length)
# A tibble: 8 x 4
id to_split left right
<dbl> <chr> <chr> <chr>
1 1 33118 33 118
2 2 37118 37 118
3 3 30121 30 121
4 4 41110 41 110
5 5 98121 98 121
6 6 101102 101 102
7 7 101110 101 110
8 8 10148 101 48
Upvotes: 1