LCM
LCM

Reputation: 412

Splitting a column into two new columns with same column name

I want to split pairs of values, which are separated by a comma within each column, into two adjacent columns within a new data frame and with the same column name for each of the two new columns.

That is, I want to convert this:

A   B   C   D   E 
1,1 0,1 1,1 1,1 1,1 
1,1 1,1 1,1 1,1 1,1
0,1 0,1 0,1 0,1 0,1 

to this:

A  A  B  B  C  C  D  D  E  E
1  1  0  1  1  1  1  1  1  1
1  1  1  1  1  1  1  1  1  1
0  1  0  1  0  1  0  1  0  1

If the data frame names can't have exact duplicates, A_1 and A_2... and so on should be ok. Or, having the names in the first row of the dataframe instead of as a header would be ok also.

My actual dataset is ~200 columns by ~13,000 rows, so I need an automated method for splitting columns and assigning names to the second version of the data frame.

Upvotes: 1

Views: 285

Answers (2)

A5C1D2H2I1M1N2O1R2T1
A5C1D2H2I1M1N2O1R2T1

Reputation: 193517

Not to kick myself here, but, with recent developments in the "stringi" package, cSplit doesn't take the cake for this type of work anymore.

Here's an approach using Stringi (though it is specific to this problem).

First, create a splitting function:

library(stringi)
Stringi <- function(indf) {
  X <- do.call(cbind, lapply(indf, stri_split_fixed, ",", simplify = TRUE))
  `dimnames<-`(`dim<-`(as.numeric(X), dim(X)), 
               list(NULL, paste0(rep(names(indf), each = 2), 1:2)))
}

Second, apply it:

Stringi(df)
#      A1 A2 B1 B2 C1 C2 D1 D2 E1 E2
# [1,]  1  1  0  1  1  1  1  1  1  1
# [2,]  1  1  1  1  1  1  1  1  1  1
# [3,]  0  1  0  1  0  1  0  1  0  1

On 13000 rows by 200 columns, cSplit takes just under 11 seconds for me, and Stringi takes less than 2.5 seconds.


Obviously, this is something already in consideration for a future version of "splitstackshape" as this Gist of mine from last month shows :-) With that version of cSplit, the time comes down from from 11 seconds to about 2.7 seconds.

Upvotes: 4

Rich Scriven
Rich Scriven

Reputation: 99331

You could use

library(splitstackshape)
(newdf <- cSplit(df, names(df), ","))
#    A_1 A_2 B_1 B_2 C_1 C_2 D_1 D_2 E_1 E_2
# 1:   1   1   0   1   1   1   1   1   1   1
# 2:   1   1   1   1   1   1   1   1   1   1
# 3:   0   1   0   1   0   1   0   1   0   1

To create duplicate column names, you can then do the following since data.table is also loaded with splitstackshape

setnames(newdf, names(newdf), sub("_.*", "", names(newdf)))
newdf
#    A A B B C C D D E E
# 1: 1 1 0 1 1 1 1 1 1 1
# 2: 1 1 1 1 1 1 1 1 1 1
# 3: 0 1 0 1 0 1 0 1 0 1

But just so you know, having duplicate column names is a terrible idea.

Upvotes: 7

Related Questions