Reputation: 37

R - splitting a column of varied string length in a data frame into multiple columns of just one character

I have a data frame like this:

Name     S1     S2     S3     Symbol
n_12     2.3    6.1    0      A
n_13     3.4    3.7    0      ACM
n_14     1.3    1.0    0      BN
n_23     2.0    4.1    0      NOPXY

And I am looking to split the last column, Symbol, into multiple columns, each with one character or nothing.

    Name     S1     S2     S3     Sy1     Sy2     Sy3     Sy4     Sy5
    n_12     2.3    6.1    0      A                               
    n_13     3.4    3.7    0      A       C       M               
    n_14     1.3    1.0    0      B       N                       
    n_23     2.0    4.1    0      N       O       P       X       Y

Thank you for any and all help with this.

Upvotes: 3

Answers (4)

Wimpel

Reputation: 27782

for completeness sake, here is a one-line data.table solution, usign tstrsplit(). The number of columns that are to be craeted are dynamic, and based on the maximum length of Symbol.

library(data.table)

dt <- fread("Name     S1     S2     S3     Symbol
n_12     2.3    6.1    0      A
n_13     3.4    3.7    0      ACM
n_14     1.3    1.0    0      BN
n_23     2.0    4.1    0      NOPXY")

dt[, paste0( "Sy", 1:length(tstrsplit(dt$Symbol, ""))) := tstrsplit( Symbol, "" )][]

#    Name  S1  S2 S3 Symbol Sy1  Sy2  Sy3  Sy4  Sy5
# 1: n_12 2.3 6.1  0      A   A <NA> <NA> <NA> <NA>
# 2: n_13 3.4 3.7  0    ACM   A    C    M <NA> <NA>
# 3: n_14 1.3 1.0  0     BN   B    N <NA> <NA> <NA>
# 4: n_23 2.0 4.1  0  NOPXY   N    O    P    X    Y

Upvotes: 0

thelatemail

Reputation: 93938

Here's a base R version using strcapture:

ns <- max(nchar(dat$Symbol))
cbind(
  dat,
  strcapture(
    paste(rep("(.)", ns), collapse=""),
    format(dat$Symbol, width=ns),
    proto=setNames(rep(list(""), ns), paste0("Sy",1:ns))
  )
)

A late base R addition using substring, which loops over each of the inputs, including the start and ends of each substring:

dat[paste0("Sy",seq(ns))] <- matrix(substring(rep(dat$Symbol,each=ns),
                                    seq(ns), seq(ns)), ncol=ns, byrow=TRUE)


#  Name  S1  S2 S3 Symbol Sy1 Sy2 Sy3 Sy4 Sy5
#1 n_12 2.3 6.1  0      A   A                
#2 n_13 3.4 3.7  0    ACM   A   C   M        
#3 n_14 1.3 1.0  0     BN   B   N            
#4 n_23 2.0 4.1  0  NOPXY   N   O   P   X   Y

Upvotes: 3

divibisan

Reputation: 12165

One way to do this is with tidyr::separate which splits a single column containing a string into multiple columns containing substrings.

df
  Name  S1  S2 S3 Symbol
1 n_12 2.3 6.1  0      A
2 n_13 3.4 3.7  0    ACM
3 n_14 1.3 1.0  0     BN
4 n_23 2.0 4.1  0  NOPXY

The sep= argument for separate accepts either a regex, or a numeric vector listing the positions in the string to split on. Since we want to split after every character, we want to give a numeric sequence from 1 to the length of the longest string (-1, since we don't need to split after the last character). The length of the longest string is calculated with max(nchar(.$Symbol)). Thanks to Rich Scriven for pointing out that nchar is vectorized and so doesn't need to be called with sapply.

We then make a character vector with the names of the columns to split Symbol into. In your case, we can just paste 'Sy' to that same numeric sequence to get c('Sy1', 'Sy2' ...)

df %>%
    tidyr::separate(Symbol,
                    sep = seq_len(max(nchar(.$Symbol)) - 1),
                    into = paste0('Sy', seq_len(max(nchar(.$Symbol)))))

  Name  S1  S2 S3 Sy1 Sy2 Sy3 Sy4 Sy5
1 n_12 2.3 6.1  0   A                
2 n_13 3.4 3.7  0   A   C   M        
3 n_14 1.3 1.0  0   B   N            
4 n_23 2.0 4.1  0   N   O   P   X   Y

If you get the following error:

Error in nchar(.$Symbol) : 'nchar()' requires a character vector

then it is likely that df$Symbol is of type factor (the default when creating or loading a data.frame) not character.

You can either provide read.table or data.frame with the argument stringsAsFactor=F to keep the Symbol variable from being converted to factor, or convert it back to character.

Tidyverse option (which can be inserted into the pipe just before the call to tidyr::separate:

df <- df %>%
    dplyr::mutate(Symbol = as.character(Symbol))

or with base R:

df$Symbol <- as.character(df$Symbol)

Upvotes: 9

Jilber Urbina

Reputation: 61214

Here's an R base using brute force:

string <- strsplit(df$Symbol, "")
ind <- max(lengths(string))
out <- data.frame(df, do.call(rbind, lapply(string, function(x) {
  if(length(x) !=  ind){
    c(x[1:length(x)], x[(length(x)+1):ind] )
  }else{
    x
  }
})))
names(out) <- sub("X(\\d)", "Sy\\1", names(out))
print(out, na.print = "")

  Name  S1  S2 S3 Symbol Sy1 Sy2 Sy3 Sy4 Sy5
1 n_12 2.3 6.1  0      A   A                
2 n_13 3.4 3.7  0    ACM   A   C   M        
3 n_14 1.3 1.0  0     BN   B   N            
4 n_23 2.0 4.1  0  NOPXY   N   O   P   X   Y

Upvotes: 1

R - splitting a column of varied string length in a data frame into multiple columns of just one character

Answers (4)

Related Questions