split columns with delimiters and multiple categories in R

Question

I have a messy table which has a single column that contains multiple category labels, separated by several delimiters. I want to us R to split that column at each delimiter, and create a new column for each category label. The methods I have seen can only split at one delimiter at a time.

My current table looks like this:

my_table = read.csv("./my_table.csv")
# > my_table
#   ID       TYPE          TEXT
# 1  1          a    blue water
# 2  2      a,b,c   fresh water
# 3  3      a;b,f   cold stream
# 4  4 f, b and c lovely sunset
# 5  5        b;c      up there

I want a table that looks like this:

#   ID A B C D          TEXT
# 1  1 a          blue water
# 2  2 a b c     fresh water
# 3  3 a b   d   cold stream
# 4  4   b c d lovely sunset
# 5  5   b c        up there

Here is what I have tried:

my_table1 <- my_table %>%
  separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
#   ID   A        B          TEXT
# 1  1   a         blue water
# 2  2   a        b   fresh water
# 3  3 a;b        f   cold stream
# 4  4   f  b and c lovely sunset
# 5  5 b;c           up there

my_table2 <- my_table1 %>%
  separate(A, c('A', 'C' ), ";")
# > docs2
#   ID A    C        B          TEXT
# 1  1 a          blue water
# 2  2 a         b   fresh water
# 3  3 a    b        f   cold stream
# 4  4 f   b and c lovely sunset
# 5  5 b    c           up there

my_table3 <- my_table2 %>%
  separate(A, c('A', 'D'), "and")
# > docs3
#   ID A    D    C        B          TEXT
# 1  1 a           blue water
# 2  2 a          b   fresh water
# 3  3 a     b        f   cold stream
# 4  4 f    b and c lovely sunset
# 5  5 b     c           up there

This gets me close, but the column names are off. Plus, I don't want to have to guess about where the string "b and c" ends up after a couple iterations. I have thousands of rows and maybe five or six categories. My guess is that there is a simpler way to do this.

MKR · Accepted Answer

The cSplit function from splitstackshape package can make problem easier to solve. An approach could be as:

library(splitstackshape)

# First use `gsub` to replace other delimiter and have only ',' delimiter. 
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)

Mod_df <- cSplit(my_table, "TYPE", sep = ",")

Mod_df
#    ID          TEXT TYPE_1 TYPE_2 TYPE_3
# 1:  1    blue water      a     NA     NA
# 2:  2   fresh water      a      b      c
# 3:  3   cold stream      a      b      f
# 4:  4 lovely sunset      f      b      c
# 5:  5      up there      b      c     NA

The tidyr::gather and spread can be used to get the format mentioned by OP as:

library(tidyr)

gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
  mutate(K = toupper(value)) %>%
  select(-key) %>%
  filter(!is.na(K)) %>%
  spread(K, value)
# ID          TEXT    A    B    C    F
# 1  1    blue water    a   
# 2  2   fresh water    a    b    c 
# 3  3   cold stream    a    b     f
# 4  4 lovely sunset     b    c    f
# 5  5      up there     b    c

Data

my_table <- read.table(text = 
"  ID       TYPE          TEXT
1  1          a    'blue water'
2  2      'a,b,c'   'fresh water'
3  3      'a;b,f'   'cold stream'
4  4 'f, b and c' 'lovely sunset'
5  5        'b;c'      'up there'",
header = TRUE, stringsAsFactors = FALSE)

split columns with delimiters and multiple categories in R

Answers (2)

Sample data

Related Questions