Reputation: 412
I want to split pairs of values, which are separated by a comma within each column, into two adjacent columns within a new data frame and with the same column name for each of the two new columns.
That is, I want to convert this:
A B C D E
1,1 0,1 1,1 1,1 1,1
1,1 1,1 1,1 1,1 1,1
0,1 0,1 0,1 0,1 0,1
to this:
A A B B C C D D E E
1 1 0 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1
0 1 0 1 0 1 0 1 0 1
If the data frame names can't have exact duplicates, A_1 and A_2... and so on should be ok. Or, having the names in the first row of the dataframe instead of as a header would be ok also.
My actual dataset is ~200 columns by ~13,000 rows, so I need an automated method for splitting columns and assigning names to the second version of the data frame.
Upvotes: 1
Views: 285
Reputation: 193517
Not to kick myself here, but, with recent developments in the "stringi" package, cSplit
doesn't take the cake for this type of work anymore.
Here's an approach using Stringi
(though it is specific to this problem).
First, create a splitting function:
library(stringi)
Stringi <- function(indf) {
X <- do.call(cbind, lapply(indf, stri_split_fixed, ",", simplify = TRUE))
`dimnames<-`(`dim<-`(as.numeric(X), dim(X)),
list(NULL, paste0(rep(names(indf), each = 2), 1:2)))
}
Second, apply it:
Stringi(df)
# A1 A2 B1 B2 C1 C2 D1 D2 E1 E2
# [1,] 1 1 0 1 1 1 1 1 1 1
# [2,] 1 1 1 1 1 1 1 1 1 1
# [3,] 0 1 0 1 0 1 0 1 0 1
On 13000 rows by 200 columns, cSplit
takes just under 11 seconds for me, and Stringi
takes less than 2.5 seconds.
Obviously, this is something already in consideration for a future version of "splitstackshape" as this Gist of mine from last month shows :-) With that version of cSplit
, the time comes down from from 11 seconds to about 2.7 seconds.
Upvotes: 4
Reputation: 99331
You could use
library(splitstackshape)
(newdf <- cSplit(df, names(df), ","))
# A_1 A_2 B_1 B_2 C_1 C_2 D_1 D_2 E_1 E_2
# 1: 1 1 0 1 1 1 1 1 1 1
# 2: 1 1 1 1 1 1 1 1 1 1
# 3: 0 1 0 1 0 1 0 1 0 1
To create duplicate column names, you can then do the following since data.table
is also loaded with splitstackshape
setnames(newdf, names(newdf), sub("_.*", "", names(newdf)))
newdf
# A A B B C C D D E E
# 1: 1 1 0 1 1 1 1 1 1 1
# 2: 1 1 1 1 1 1 1 1 1 1
# 3: 0 1 0 1 0 1 0 1 0 1
But just so you know, having duplicate column names is a terrible idea.
Upvotes: 7