Marc
Marc

Reputation: 9

Complex String Split to Columns In R

I'm working with a very messy data set that has a column that needs to be split into several more columns based on a standard delimiter ",|".

This is what entries in said column look like:

Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes

The main problem I'm having is that not all descriptors that need to be split appear in the same order. Sometimes color is first, other times it appears last. Additionally, Some metrics do not appear, for example, "Video-Status" isn't in every row.

What would be the best way to go about creating 6 new columns from the data I've provided? Scratching my head here...

Upvotes: 1

Views: 198

Answers (2)

Maurits Evers
Maurits Evers

Reputation: 50668

I would do this using various tidyr functions. I created some sample data with entries being swapped and missing.

library(tidyverse)
df %>%
    rowid_to_column("row") %>%
    separate_rows(V1, sep = "\\|") %>%
    mutate(V1 = str_replace(V1, ",$", "")) %>%
    separate(V1, c("key", "value"), sep = ":") %>%
    spread(key, value, fill = NA)
#  row Color   ID Location    Shape Texture Video-Status
#1   1   Red 1323   Canada Circular   Rough          Yes
#2   2   Red 1323   Canada Circular   Rough          Yes
#3   3   Red 1323   Canada Circular   Rough         <NA>

Explanation: We first separate entries into different rows by splitting entries at "|", remove trailing ",", separate entries into different columns by splitting entries at ":" and finally reshape from long to wide to produce your expected output.


Sample data

df <- read.table(text =
    "Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
    Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
    Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada")

Upvotes: 1

thelatemail
thelatemail

Reputation: 93813

There is an obscure R function, read.dcf that can deal with Name:Value pair data. Here's an example with multiple rows, with the order and the completeness of each pair varying:

x  <- "Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes"
x2 <- "Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada"
dat <- data.frame(col = c(x,x2), stringsAsFactors=FALSE)

dat
#                                                                                   col
#1 Color:Red,|Texture:Rough,|Shape:Circular,|ID:1323,|Location:Canada,|Video-Status:Yes
#2                   Texture:Rough,|Color:Red,|Shape:Circular,|ID:1323,|Location:Canada

Then process after collapsing to one long piece of text with line breaks:

read.dcf(textConnection(paste(gsub(",[|]", "\n", dat$col), collapse="\n\n")))
#     Color Texture Shape      ID     Location Video-Status
#[1,] "Red" "Rough" "Circular" "1323" "Canada" "Yes"       
#[2,] "Red" "Rough" "Circular" "1323" "Canada" NA  

Upvotes: 2

Related Questions