Reputation: 359
Problem
I want to selectively split some strings in a column. My limited ability with regular expressions is a problem, but your thoughts would be welcome.
Reprex
My starting dataset:
media <- c("TV, Radio and Games", "Theatre, Festival", "Festival and Arts", "TV", "Theatre", "Radio")
type <- c("Indoors", "Outdoors", "Outdoors", "Indoors", "Outdoors", "Indoors")
df <- as.data.frame(cbind(media, type))
df
# media type
# 1 TV, Radio and Games Indoors
# 2 Theatre, Festival Outdoors
# 3 Festival and Arts Outdoors
# 4 TV Indoors
# 5 Theatre Outdoors
# 6 Radio Indoors
Output Wanted
media2 <- c("TV", "Radio", "Games", "Theatre", "Festival", "Festival and Arts", "TV", "Theatre", "Radio")
type2 <- c("Indoors", "Indoors", "Indoors", "Outdoors", "Outdoors", "Outdoors", "Indoors", "Outdoors", "Indoors")
df2 <- cbind(media2, type2)
df2
# media2 type2
# [1,] "TV" "Indoors"
# [2,] "Radio" "Indoors"
# [3,] "Games" "Indoors"
# [4,] "Theatre" "Outdoors"
# [5,] "Festival" "Outdoors"
# [6,] "Festival and Arts" "Outdoors"
# [7,] "TV" "Indoors"
# [8,] "Theatre" "Outdoors"
# [9,] "Radio" "Indoors"
Attempted Solution
df3 <- df %>%
tidyr::separate(media, c("media1", "media2"), ", ") %>%
tidyr::pivot_longer(1:2, names_to = "media_type", values_to = "media") %>%
select(media, type) %>%
filter(!is.na(media))
# media type
# 1 TV Indoors
# 2 Radio and Games Indoors
# 3 Theatre Outdoors
# 4 Festival Outdoors
# 5 Festival and Arts Outdoors
# 6 TV Indoors
# 7 Theatre Outdoors
# 8 Radio Indoors
This gets me some of the way there, but I also want "Radio and Games" split up, but not "Festival and Arts". I cannot split it using " and ", as that would split them both.
Upvotes: 3
Views: 76
Reputation: 39737
An option will be to look behind if Arts follows and using (?! Arts)
.
x <- strsplit(df$media, "\\s*(,|and(?! Arts))\\s*", perl=TRUE)
data.frame(media=unlist(x), type=rep(df$type, lengths(x)))
# media type
#1 TV Indoors
#2 Radio Indoors
#3 Games Indoors
#4 Theatre Outdoors
#5 Festival Outdoors
#6 Festival and Arts Outdoors
#7 TV Indoors
#8 Theatre Outdoors
#9 Radio Indoors
Or splitting only those which have at least one ,
.
i <- grepl(",", df$media)
x <- ifelse(i, str_split(df$media, pattern = "\\s*(,|and)\\s*"), df$media)
data.frame(media=unlist(x), type=rep(df$type, lengths(x)))
Upvotes: 1
Reputation: 16998
You could use a tidyverse
approach:
library(dplyr)
library(tidyr)
library(stringr)
df %>%
mutate(media = ifelse(str_detect(media, ","),
str_split(media, pattern = "\\s*(,|and)\\s*"),
media)) %>%
unnest(media)
This returns
# A tibble: 9 x 2
media type
<chr> <chr>
1 TV Indoors
2 Radio Indoors
3 Games Indoors
4 Theatre Outdoors
5 Festival Outdoors
6 Festival and Arts Outdoors
7 TV Indoors
8 Theatre Outdoors
9 Radio Indoors
Upvotes: 3
Reputation: 79286
Here is tidyverse workaround:
The problem is with Festival and Arts
because it contents and, otherwise it would be a simple seperate_rows
So:
Festival and Arts
in preserve <- df[3,]
filter
out rows with Festival and Arts
id
to perform separate_rows
and
preserve
library(dplyr)
library(tidyr)
preserve <- df[3,]
df %>%
filter(!grepl('Festival and Arts', media)) %>%
mutate(id = row_number()) %>%
separate_rows(media, id, sep = "[^[:alnum:].]+", convert = TRUE) %>%
filter(!grepl('and', media)) %>%
bind_rows(preserve) %>%
select(-id)
output:
media type
<chr> <chr>
1 TV Indoors
2 Radio Indoors
3 Games Indoors
4 Theatre Outdoors
5 Festival Outdoors
6 TV Indoors
7 Theatre Outdoors
8 Radio Indoors
9 Festival and Arts Outdoors
>
Upvotes: 1