Reputation: 9
I'm working with a very messy data set that has a column that needs to be split into several more columns based on a standard delimiter ",|"
.
This is what entries in said column look like:
Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
The main problem I'm having is that not all descriptors that need to be split appear in the same order. Sometimes color is first, other times it appears last. Additionally, Some metrics do not appear, for example, "Video-Status" isn't in every row.
What would be the best way to go about creating 6 new columns from the data I've provided? Scratching my head here...
Upvotes: 1
Views: 198
Reputation: 50668
I would do this using various tidyr
functions. I created some sample data with entries being swapped and missing.
library(tidyverse)
df %>%
rowid_to_column("row") %>%
separate_rows(V1, sep = "\\|") %>%
mutate(V1 = str_replace(V1, ",$", "")) %>%
separate(V1, c("key", "value"), sep = ":") %>%
spread(key, value, fill = NA)
# row Color ID Location Shape Texture Video-Status
#1 1 Red 1323 Canada Circular Rough Yes
#2 2 Red 1323 Canada Circular Rough Yes
#3 3 Red 1323 Canada Circular Rough <NA>
Explanation: We first separate entries into different rows by splitting entries at "|"
, remove trailing ","
, separate entries into different columns by splitting entries at ":"
and finally reshape from long to wide to produce your expected output.
df <- read.table(text =
"Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada")
Upvotes: 1
Reputation: 93813
There is an obscure R function, read.dcf
that can deal with Name:Value
pair data. Here's an example with multiple rows, with the order and the completeness of each pair varying:
x <- "Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes"
x2 <- "Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada"
dat <- data.frame(col = c(x,x2), stringsAsFactors=FALSE)
dat
# col
#1 Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
#2 Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada
Then process after collapsing to one long piece of text with line breaks:
read.dcf(textConnection(paste(gsub(",[|]", "\n", dat$col), collapse="\n\n")))
# Color Texture Shape ID Location Video-Status
#[1,] "Red" "Rough" "Circular" "1323" "Canada" "Yes"
#[2,] "Red" "Rough" "Circular" "1323" "Canada" NA
Upvotes: 2