Reputation: 93
I have a data frame containing 3 columns. The first one has string values some of which contain commas, the other two: ID and area information.
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
head(df)
column1 column2 column3
1 ab 34, 35, 36 ID_27 area51
2 cb 23 ID_28 area52
3 df 45, 46 ID_29 area53
4 gh 21 ID_30 area54
What I would like to do is modify the values in the first column, so that commas separating values disappear, apply the two letter prefix to each number value and rearrage each value separated by a comma in a new row. At the same time, duplicating the values in the other columns, like the example below:
new_df <- data_frame(column1 = c("ab34", "ab35", "ab36", "cb23", "df45", "df46", "gh21"),
column2 = c("ID_27", "ID_27", "ID_27", "ID_28", "ID_29", "ID_29", "ID_30"),
column3 = c("area51", "area51", "area51", "area52", "area53", "area53", "area54"))
head(new_df)
# A tibble: 6 x 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 cb23 ID_28 area52
5 df45 ID_29 area53
6 df46 ID_29 area53
Does anybody know what R code could achieve this? Using either tidyverse or older approaches? Btw, no need to go from a dataframe to a tibble, this is just for the example. My aim is to transform a dataframe.
Upvotes: 1
Views: 43
Reputation: 21400
Here's a (hopefully) easier-to understand step-by-step approach:
library(dplyr)
library(stringr)
library(tidyr)
df %>%
mutate(
# extract the prefix:
prefix = str_extract(column1,"^\\w+"),
# extract the digits in a list:
digits = str_extract_all(column1,"\\d+")) %>%
# cast the list values in `digits` into long format:
unnest_longer(digits) %>%
# paste `prefix`, `digits` together:
mutate(column1 = paste0(prefix, digits)) %>%
# remove obsolete columns:
select(-c(prefix, digits))
# A tibble: 7 × 3
column1 column2 column3
<chr> <chr> <chr>
1 ab34 ID_27 area51
2 ab35 ID_27 area51
3 ab36 ID_27 area51
4 cb23 ID_28 area52
5 df45 ID_29 area53
6 df46 ID_29 area53
7 gh21 ID_30 area54
Upvotes: 0
Reputation: 25323
A possible solution:
library(tidyverse)
df <- data.frame(column1 = c("ab 34, 35, 36", "cb 23", "df 45, 46", "gh 21"),
column2 = c("ID_27", "ID_28", "ID_29", "ID_30"),
column3 = c("area51", "area52", "area53", "area54"))
df %>%
mutate(column1 = str_replace_all(column1,", ", str_extract(column1,"^\\S+")) %>%
str_remove(.," ")) %>%
separate_rows(column1, sep = "(?<=\\d)(?=\\D)")
#> # A tibble: 7 × 3
#> column1 column2 column3
#> <chr> <chr> <chr>
#> 1 ab34 ID_27 area51
#> 2 ab35 ID_27 area51
#> 3 ab36 ID_27 area51
#> 4 cb23 ID_28 area52
#> 5 df45 ID_29 area53
#> 6 df46 ID_29 area53
#> 7 gh21 ID_30 area54
Upvotes: 1