Reputation: 85
I have a messy table which has a single column that contains multiple category labels, separated by several delimiters. I want to us R to split that column at each delimiter, and create a new column for each category label. The methods I have seen can only split at one delimiter at a time.
My current table looks like this:
my_table = read.csv("./my_table.csv")
# > my_table
# ID TYPE TEXT
# 1 1 a blue water
# 2 2 a,b,c fresh water
# 3 3 a;b,f cold stream
# 4 4 f, b and c lovely sunset
# 5 5 b;c up there
I want a table that looks like this:
# ID A B C D TEXT
# 1 1 a blue water
# 2 2 a b c fresh water
# 3 3 a b d cold stream
# 4 4 b c d lovely sunset
# 5 5 b c up there
Here is what I have tried:
my_table1 <- my_table %>%
separate(TYPE, c('A', 'B'), ",")
my_table1
# > docs1
# ID A B TEXT
# 1 1 a <NA> blue water
# 2 2 a b fresh water
# 3 3 a;b f cold stream
# 4 4 f b and c lovely sunset
# 5 5 b;c <NA> up there
my_table2 <- my_table1 %>%
separate(A, c('A', 'C' ), ";")
# > docs2
# ID A C B TEXT
# 1 1 a <NA> <NA> blue water
# 2 2 a <NA> b fresh water
# 3 3 a b f cold stream
# 4 4 f <NA> b and c lovely sunset
# 5 5 b c <NA> up there
my_table3 <- my_table2 %>%
separate(A, c('A', 'D'), "and")
# > docs3
# ID A D C B TEXT
# 1 1 a <NA> <NA> <NA> blue water
# 2 2 a <NA> <NA> b fresh water
# 3 3 a <NA> b f cold stream
# 4 4 f <NA> <NA> b and c lovely sunset
# 5 5 b <NA> c <NA> up there
This gets me close, but the column names are off. Plus, I don't want to have to guess about where the string "b and c" ends up after a couple iterations. I have thousands of rows and maybe five or six categories. My guess is that there is a simpler way to do this.
Upvotes: 0
Views: 387
Reputation: 20095
The cSplit
function from splitstackshape
package can make problem easier to solve. An approach could be as:
library(splitstackshape)
# First use `gsub` to replace other delimiter and have only ',' delimiter.
my_table$TYPE <- gsub("and|;",",",my_table$TYPE)
Mod_df <- cSplit(my_table, "TYPE", sep = ",")
Mod_df
# ID TEXT TYPE_1 TYPE_2 TYPE_3
# 1: 1 blue water a NA NA
# 2: 2 fresh water a b c
# 3: 3 cold stream a b f
# 4: 4 lovely sunset f b c
# 5: 5 up there b c NA
The tidyr::gather
and spread
can be used to get the format mentioned by OP as:
library(tidyr)
gather(Mod_df, key, value, -ID,-TEXT) %>% mutate_if(is.factor, as.character) %>%
mutate(K = toupper(value)) %>%
select(-key) %>%
filter(!is.na(K)) %>%
spread(K, value)
# ID TEXT A B C F
# 1 1 blue water a <NA> <NA> <NA>
# 2 2 fresh water a b c <NA>
# 3 3 cold stream a b <NA> f
# 4 4 lovely sunset <NA> b c f
# 5 5 up there <NA> b c <NA>
Data
my_table <- read.table(text =
" ID TYPE TEXT
1 1 a 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'",
header = TRUE, stringsAsFactors = FALSE)
Upvotes: 1
Reputation: 50678
As an alternative and to extend your tidyverse
attempt, here is a solution using strsplit
and unnest
:
df %>%
mutate(
val = strsplit(as.character(TYPE), "(;|,\\s*|\\s*and\\s*)")) %>%
unnest() %>%
select(-TYPE) %>%
group_by(ID, TEXT) %>%
mutate(n = 1:n()) %>%
spread(n, val)
## A tibble: 5 x 5
## Groups: ID, TEXT [5]
# ID TEXT `1` `2` `3`
# <int> <fct> <chr> <chr> <chr>
#1 1 blue water a NA NA
#2 2 fresh water a b c
#3 3 cold stream a b f
#4 4 lovely sunset f b c
#5 5 up there b c NA
Note that this is not exactly the same as your expected output. It does however match @MKR's output.
df <- read.table(text =
"ID TYPE TEXT
1 1 'a' 'blue water'
2 2 'a,b,c' 'fresh water'
3 3 'a;b,f' 'cold stream'
4 4 'f, b and c' 'lovely sunset'
5 5 'b;c' 'up there'")
Upvotes: 2