Reputation: 37
I have a data frame like this:
Name S1 S2 S3 Symbol
n_12 2.3 6.1 0 A
n_13 3.4 3.7 0 ACM
n_14 1.3 1.0 0 BN
n_23 2.0 4.1 0 NOPXY
And I am looking to split the last column, Symbol, into multiple columns, each with one character or nothing.
Name S1 S2 S3 Sy1 Sy2 Sy3 Sy4 Sy5
n_12 2.3 6.1 0 A
n_13 3.4 3.7 0 A C M
n_14 1.3 1.0 0 B N
n_23 2.0 4.1 0 N O P X Y
Thank you for any and all help with this.
Upvotes: 3
Views: 2596
Reputation: 27732
for completeness sake, here is a one-line data.table
solution, usign tstrsplit()
. The number of columns that are to be craeted are dynamic, and based on the maximum length of Symbol
.
library(data.table)
dt <- fread("Name S1 S2 S3 Symbol
n_12 2.3 6.1 0 A
n_13 3.4 3.7 0 ACM
n_14 1.3 1.0 0 BN
n_23 2.0 4.1 0 NOPXY")
dt[, paste0( "Sy", 1:length(tstrsplit(dt$Symbol, ""))) := tstrsplit( Symbol, "" )][]
# Name S1 S2 S3 Symbol Sy1 Sy2 Sy3 Sy4 Sy5
# 1: n_12 2.3 6.1 0 A A <NA> <NA> <NA> <NA>
# 2: n_13 3.4 3.7 0 ACM A C M <NA> <NA>
# 3: n_14 1.3 1.0 0 BN B N <NA> <NA> <NA>
# 4: n_23 2.0 4.1 0 NOPXY N O P X Y
Upvotes: 0
Reputation: 93803
Here's a base R version using strcapture
:
ns <- max(nchar(dat$Symbol))
cbind(
dat,
strcapture(
paste(rep("(.)", ns), collapse=""),
format(dat$Symbol, width=ns),
proto=setNames(rep(list(""), ns), paste0("Sy",1:ns))
)
)
A late base R addition using substring
, which loops over each of the inputs, including the start and ends of each substring:
dat[paste0("Sy",seq(ns))] <- matrix(substring(rep(dat$Symbol,each=ns),
seq(ns), seq(ns)), ncol=ns, byrow=TRUE)
# Name S1 S2 S3 Symbol Sy1 Sy2 Sy3 Sy4 Sy5
#1 n_12 2.3 6.1 0 A A
#2 n_13 3.4 3.7 0 ACM A C M
#3 n_14 1.3 1.0 0 BN B N
#4 n_23 2.0 4.1 0 NOPXY N O P X Y
Upvotes: 3
Reputation: 12155
One way to do this is with tidyr::separate
which splits a single column containing a string into multiple columns containing substrings.
df
Name S1 S2 S3 Symbol
1 n_12 2.3 6.1 0 A
2 n_13 3.4 3.7 0 ACM
3 n_14 1.3 1.0 0 BN
4 n_23 2.0 4.1 0 NOPXY
The sep=
argument for separate
accepts either a regex, or a numeric vector listing the positions in the string to split on. Since we want to split after every character, we want to give a numeric sequence from 1 to the length of the longest string (-1
, since we don't need to split after the last character). The length of the longest string is calculated with max(nchar(.$Symbol))
. Thanks to Rich Scriven for pointing out that nchar
is vectorized and so doesn't need to be called with sapply
.
We then make a character vector with the names of the columns to split Symbol
into. In your case, we can just paste 'Sy'
to that same numeric sequence to get c('Sy1', 'Sy2' ...)
df %>%
tidyr::separate(Symbol,
sep = seq_len(max(nchar(.$Symbol)) - 1),
into = paste0('Sy', seq_len(max(nchar(.$Symbol)))))
Name S1 S2 S3 Sy1 Sy2 Sy3 Sy4 Sy5
1 n_12 2.3 6.1 0 A
2 n_13 3.4 3.7 0 A C M
3 n_14 1.3 1.0 0 B N
4 n_23 2.0 4.1 0 N O P X Y
If you get the following error:
Error in nchar(.$Symbol) : 'nchar()' requires a character vector
then it is likely that df$Symbol
is of type factor
(the default when creating or loading a data.frame
) not character
.
You can either provide read.table
or data.frame
with the argument stringsAsFactor=F
to keep the Symbol
variable from being converted to factor
, or convert it back to character
.
Tidyverse option (which can be inserted into the pipe just before the call to tidyr::separate
:
df <- df %>%
dplyr::mutate(Symbol = as.character(Symbol))
or with base R:
df$Symbol <- as.character(df$Symbol)
Upvotes: 9
Reputation: 61154
Here's an R base using brute force:
string <- strsplit(df$Symbol, "")
ind <- max(lengths(string))
out <- data.frame(df, do.call(rbind, lapply(string, function(x) {
if(length(x) != ind){
c(x[1:length(x)], x[(length(x)+1):ind] )
}else{
x
}
})))
names(out) <- sub("X(\\d)", "Sy\\1", names(out))
print(out, na.print = "")
Name S1 S2 S3 Symbol Sy1 Sy2 Sy3 Sy4 Sy5
1 n_12 2.3 6.1 0 A A
2 n_13 3.4 3.7 0 ACM A C M
3 n_14 1.3 1.0 0 BN B N
4 n_23 2.0 4.1 0 NOPXY N O P X Y
Upvotes: 1