Reputation: 195
So I have this data which is very messy and in particular one column is causing me huge problems. Basically instead of having one observation per row, some of the rows have two or more observations, usually separated by a comma.
My challenge is to separate out the different observations into different columns but the separate function in the tidyr package doesnt work because some rows have more observations than others.
So I figure I need something that tells me how many comma separated observations I am dealing with in each row. Then I can maybe move forward with separating them all out. But I have no idea what that something might be.
Any help you can give me would be much appreciated.
Thanks in advance
Upvotes: 1
Views: 148
Reputation: 31171
You can use regex if you to know the number of comma:
len = sapply(gregexpr(",",df$x), length) + 1
len
#[1] 3 2 4
And then use separate from tidyr
by specifying extra option:
library(tidyr)
separate(df, x, sep=',', into=paste0('col',1:max(len)), extra = "merge")
# col1 col2 col3 col4
#1 a1 a2 a3 <NA>
#2 b1 b2 <NA> <NA>
#3 c1 c2 c3 c4
Note than behind separate there is no more than strsplit
function!
Data:
df = data.frame(x = c("a1,a2,a3", "b1,b2", "c1,c2,c3,c4"))
Upvotes: 1
Reputation: 887118
You could try cSplit
("df" from @Colonel Beauvel's post)
library(splitstackshape)
cSplit(df, 'x', sep=',')
# x_1 x_2 x_3 x_4
#1: a1 a2 a3 NA
#2: b1 b2 NA NA
#3: c1 c2 c3 c4
Upvotes: 2
Reputation: 4042
Perhaps you can use strsplit
to split your comma-separated variable into a list of strings and then use the sapply
function to count the length of each list element:
> x <- c("a1,a2,a3", "b1,b2", "c1,c2,c3,c4")
> (tmp <- strsplit(x, ","))
[[1]]
[1] "a1" "a2" "a3"
[[2]]
[1] "b1" "b2"
[[3]]
[1] "c1" "c2" "c3" "c4"
> sapply(tmp, length)
[1] 3 2 4
Upvotes: 1