Reputation: 13
I have a dataset where there may be spaces, additional commas, and other junk amongst some numbers (see 'trial' below). Each number is separated by minimum one comma. I need to keep the numbers in the same row while padding each number with zeroes to give a total length of six (see 'desiredTrial1'). I then need to paste (or str_replace perhaps) the 2-character grower code in front of each number (see 'desiredTrial2'). A bonus would be to have the output in numerical order (though not essential)
Appreciate your help.
trial <- data.frame(
grower = c("me", "mE", "aD"),
number = c("me189, Me193, mej28", "123,,,ME234", "aD234, ad573, AD 92113, ,,Ad33")
)
desiredTrial2 <- data.frame(
grower = c("ME", "ME", "AD"),
number = c("ME000028,ME000189,ME000193", "ME000123,ME000234", "AD000033,AD000234,AD000573,AD092113")
)
I have tried the following:
trial1 <- trial %>%
mutate(
grower = as.character(case_when(
str_detect(grower, fixed("me", ignore_case = TRUE)) ~ "ME",
str_detect(grower, fixed("ad", ignore_case = TRUE)) ~ "AD",
)
)
) %>%
mutate(
number = str_replace_all(number, pattern = " *", replacement = ",")# Get rid of any spaces and convert to a comma
) %>%
mutate(
number = str_replace_all(number, pattern = "[^\\d,]+|^,*|,*$", replacement = "")# We get rid of anything that isn't a number or a digit; we also get rid of any leading commas (^,*) and any trailing commas (,*$)
) %>%
mutate(
number = str_replace_all(number, pattern = "(?<=,),", replacement = "")# Where there are more than one comma we get rid of them
)
From here I've tried str_pad but can't get it to work between the commas (as it treats the entire thing like a string). I can't simply swap the comma for "000" either as the numbers have different lengths. I've also tried separate but it gets unwieldly given the different number of elements in each vector as it separates (and not knowing how many columns are required).
I was hoping to get to this point (desiredTrial1) at which point I think I should be able to get to desiredTrial2 (my desired end state):
desiredTrial1 <- data.frame(
grower = c("ME", "ME", "AD"),
number = c("000028,000189,000193", "000123,000234", "000033,000234,000573,092113")
)
Appreciate your help.
Upvotes: 1
Views: 88
Reputation: 125418
Using tidyr::separate_rows
your could first split your string in multiple rows. Afterwards you could process each part of the split separately:
library(tidyr)
library(dplyr, warn = FALSE)
library(stringr)
trial |>
mutate(row = row_number()) |>
tidyr::separate_rows(number, sep = ",") |>
mutate(number = readr::parse_number(number)) |>
filter(!is.na(number)) |>
mutate(number = str_pad(number, 6, pad = "0"),
number = paste0(toupper(grower), number)) |>
arrange(grower, number) |>
group_by(row, grower = toupper(grower)) |>
summarise(number = paste(number, collapse = ","), .groups = "drop") |>
select(-row)
#> # A tibble: 3 × 2
#> grower number
#> <chr> <chr>
#> 1 ME ME000028,ME000189,ME000193
#> 2 ME ME000123,ME000234
#> 3 AD AD000033,AD000234,AD000573,AD092113
Upvotes: 1