Reputation: 45
My data frame is this:
data.frame(stringsAsFactors=FALSE,
A = c("1234", "abc.", "e-2.1ad"),
B = c("5-4", "1-0", "a,d")
)
I want to separate the columns into multiple columns containing individual characters.
The other answers that I found, all involved using some regular expression or pattern or separator, which as you see, I can't do here, or convoluted solutions using sapply
(which used the position, but for me it didn't work).
I'm sure there's a more elegant solution out there and I would really love a solution using tidyr
if possible, but whatever does it cleanly is much appreciated.
This is what it should like, after all is said and done:
newdf <- data.frame(stringsAsFactors=FALSE,
A1 = c("1", "a", "e"),
A2 = c("2", "b", "-"),
A3 = c("3", "c", "2"),
A4 = c("4", ".", "."),
A5 = c(NA, NA, 1),
A6 = c(NA, NA, "a"),
A7 = c(NA, NA, "d"),
B1 = c("5", "1", "a"),
B2 = c("-", "-", ","),
B3 = c("4", "0", "d")
)
And, if the answer is more than throwing a function or two at it, I would really appreciate if you could explain how you go about it, rather than just the solution itself. Thank you!
Later edit: I was able to almost do it using the qdap
package but I could get around it filling what should've been NAs (because of the strings' unequal lengths) with characters from the beginning of the string. Very odd behavior which wasn't explained in the documentation, otherwise a very promising function.
Another strange behavior that I noticed in my lame attempts to solve this was automatically transforming from characters into factors. However, I wasn't able to pinpoint where it happens along the way.
Upvotes: 2
Views: 258
Reputation: 30549
There are a number of potential options, depending on details of what you are interested in. See @Elin's comment above regarding missing 32 in 5-432.
One possibility to consider is str_split_fixed
from stringr
package:
str_split_fixed("1234", "", 7)
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] "1" "2" "3" "4" "" "" ""
An empty pattern "" would split by character, and in this case try to return 7 pieces as a character matrix (with the last 3 empty strings). Right now, if no character is available, it returns an empty string, not NA. (see github issue).
If the number of columns was based on the maximum number of characters possible for columns A and B (7 and 5 for example), one could do the following:
as.data.frame(lapply(df, function(x) str_split_fixed(x, "", n=max(nchar(x)))))
A.1 A.2 A.3 A.4 A.5 A.6 A.7 B.1 B.2 B.3 B.4 B.5
1 1 2 3 4 5 - 4 3 2
2 a b c . 1 - 0
3 e - 2 . 1 a d a , d
Note: To replace the empty strings afterwards with NA:
df[df==""] <- NA
A.1 A.2 A.3 A.4 A.5 A.6 A.7 B.1 B.2 B.3 B.4 B.5
1 1 2 3 4 <NA> <NA> <NA> 5 - 4 3 2
2 a b c . <NA> <NA> <NA> 1 - 0 <NA> <NA>
3 e - 2 . 1 a d a , d <NA> <NA>
Upvotes: 2
Reputation: 389325
We can use cSplit
from splitstackshape
and split every character in column A
and B
into separate column
df1 <- splitstackshape::cSplit(df, c('A', 'B'), sep = '', stripWhite = FALSE)
df1
# A_1 A_2 A_3 A_4 A_5 A_6 A_7 B_1 B_2 B_3 B_4 B_5 B_6 B_7
#1: 1 2 3 4 NA <NA> <NA> 5 - 4 3 2 NA NA
#2: a b c . NA <NA> <NA> 1 - 0 NA NA NA NA
#3: e - 2 . 1 a d a , d NA NA NA NA
However, this gave me some additional columns with NA
for B
which can be removed using Filter
Filter(function(x) any(!is.na(x)), df1)
# A_1 A_2 A_3 A_4 A_5 A_6 A_7 B_1 B_2 B_3 B_4 B_5
#1: 1 2 3 4 NA <NA> <NA> 5 - 4 3 2
#2: a b c . NA <NA> <NA> 1 - 0 NA NA
#3: e - 2 . 1 a d a , d NA NA
data
df <- data.frame(stringsAsFactors=FALSE,
A = c("1234", "abc.", "e-2.1ad"),
B = c("5-432", "1-0", "a,d"))
Upvotes: 1
Reputation: 6769
This is my tidyverse
solutions. Writing a function is new to me, any suggestions for improvement would be appreciated.
library(tidyverse)
df <- data.frame(stringsAsFactors=FALSE,
A = c("1234", "abc.", "e-2.1ad"),
B = c("5-432", "1-0", "a,d"))
a_split<- str_split(df$A, "")
b_split<- str_split(df$B, "")
f1 <- function(num, s)(c(s[[1]][num], s[[2]][num], s[[3]][num]))
x <- c(1:7)
all_a <- lapply(x, f1, a_split)
x <- c(1:5)
all_b <- lapply(x, f1, b_split)
Upvotes: 1