Namra
Namra

Reputation: 359

How to split R column into separate columns selectively

Problem

I want to selectively split some strings in a column. My limited ability with regular expressions is a problem, but your thoughts would be welcome.

Reprex

My starting dataset:

media <- c("TV, Radio and Games", "Theatre, Festival", "Festival and Arts", "TV", "Theatre", "Radio")
type <- c("Indoors", "Outdoors", "Outdoors", "Indoors", "Outdoors", "Indoors")

df <- as.data.frame(cbind(media, type))
df

#   media                type
# 1 TV, Radio and Games    Indoors
# 2   Theatre, Festival    Outdoors
# 3   Festival and Arts    Outdoors
# 4                  TV    Indoors
# 5             Theatre    Outdoors
# 6               Radio    Indoors

Output Wanted

media2 <- c("TV", "Radio", "Games", "Theatre", "Festival", "Festival and Arts", "TV", "Theatre", "Radio")
type2 <- c("Indoors", "Indoors", "Indoors", "Outdoors", "Outdoors", "Outdoors", "Indoors", "Outdoors", "Indoors")

df2 <- cbind(media2, type2)
df2

#      media2              type2     
# [1,] "TV"                "Indoors" 
# [2,] "Radio"             "Indoors" 
# [3,] "Games"             "Indoors" 
# [4,] "Theatre"           "Outdoors"
# [5,] "Festival"          "Outdoors"
# [6,] "Festival and Arts" "Outdoors"
# [7,] "TV"                "Indoors" 
# [8,] "Theatre"           "Outdoors"
# [9,] "Radio"             "Indoors" 

Attempted Solution

df3 <- df %>% 
  tidyr::separate(media, c("media1", "media2"), ", ") %>% 
  tidyr::pivot_longer(1:2, names_to = "media_type", values_to = "media") %>% 
  select(media, type) %>% 
  filter(!is.na(media))

#   media             type    
# 1 TV                Indoors 
# 2 Radio and Games   Indoors 
# 3 Theatre           Outdoors
# 4 Festival          Outdoors
# 5 Festival and Arts Outdoors
# 6 TV                Indoors 
# 7 Theatre           Outdoors
# 8 Radio             Indoors 

This gets me some of the way there, but I also want "Radio and Games" split up, but not "Festival and Arts". I cannot split it using " and ", as that would split them both.

Upvotes: 3

Views: 76

Answers (3)

GKi
GKi

Reputation: 39737

An option will be to look behind if Arts follows and using (?! Arts).

x <- strsplit(df$media, "\\s*(,|and(?! Arts))\\s*", perl=TRUE)
data.frame(media=unlist(x), type=rep(df$type, lengths(x)))
#              media     type
#1                TV  Indoors
#2             Radio  Indoors
#3             Games  Indoors
#4           Theatre Outdoors
#5          Festival Outdoors
#6 Festival and Arts Outdoors
#7                TV  Indoors
#8           Theatre Outdoors
#9             Radio  Indoors

Or splitting only those which have at least one ,.

i <- grepl(",", df$media)
x <- ifelse(i, str_split(df$media, pattern = "\\s*(,|and)\\s*"), df$media)
data.frame(media=unlist(x), type=rep(df$type, lengths(x)))

Upvotes: 1

Martin Gal
Martin Gal

Reputation: 16998

You could use a tidyverse approach:

library(dplyr)
library(tidyr)
library(stringr)

df %>% 
  mutate(media = ifelse(str_detect(media, ","),
                        str_split(media, pattern = "\\s*(,|and)\\s*"),
                        media)) %>% 
  unnest(media)

This returns

# A tibble: 9 x 2
  media             type    
  <chr>             <chr>   
1 TV                Indoors 
2 Radio             Indoors 
3 Games             Indoors 
4 Theatre           Outdoors
5 Festival          Outdoors
6 Festival and Arts Outdoors
7 TV                Indoors 
8 Theatre           Outdoors
9 Radio             Indoors 

Upvotes: 3

TarJae
TarJae

Reputation: 79286

Here is tidyverse workaround:

The problem is with Festival and Arts because it contents and, otherwise it would be a simple seperate_rows

So:

  1. save Festival and Arts in preserve <- df[3,]
  2. filter out rows with Festival and Arts
  3. add an id to perform separate_rows
  4. after that filter out rows with and
  5. and bind to preserve
library(dplyr)
library(tidyr)

preserve <- df[3,]

df %>% 
    filter(!grepl('Festival and Arts', media)) %>% 
    mutate(id = row_number()) %>% 
    separate_rows(media, id, sep = "[^[:alnum:].]+", convert = TRUE) %>% 
    filter(!grepl('and', media)) %>% 
    bind_rows(preserve) %>% 
    select(-id)

output:

media             type    
  <chr>             <chr>   
1 TV                Indoors 
2 Radio             Indoors 
3 Games             Indoors 
4 Theatre           Outdoors
5 Festival          Outdoors
6 TV                Indoors 
7 Theatre           Outdoors
8 Radio             Indoors 
9 Festival and Arts Outdoors
> 

Upvotes: 1

Related Questions