Reputation: 6639
I have a bunch of columns in a dataframe which I want to paste together (seperated by "-") as follows:
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
i.e.
a b c d
1 a d g
2 b e h
3 c f i
Which I want to become:
a x
1 a-d-g
2 b-e-h
3 c-f-i
I could normally do this with:
within(data, x <- paste(b,c,d,sep='-'))
and then removing the old columns, but unfortunately I do not know the names of the columns specifically, only a collective name for all of the columns, e.g. I would know that cols <- c('b','c','d')
Does anyone know a way of doing this?
Upvotes: 132
Views: 168888
Reputation: 1648
Using tidyr
package, this can be easily handled in 1 function call.
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
Exclude first column, everything else gets pasted.
# tidyr_0.6.3
unite(data, newCol, -a)
# or by column index unite(data, newCol, -1)
# a newCol
# 1 1 a_d_g
# 2 2 b_e_h
# 3 3 c_f_i
Upvotes: 56
Reputation: 1329
Simple and straightforward code with unite
from {tidyr} v1.2.0
{tidyr v1.2.0}
library(tidyr)
data %>% unite("x", all_of(cols), remove = T, sep = "-")
"x"
is the name of the new column.all_of(cols)
is a selection of what columns we want to merge. Using <tidy-select>
the column names don't need to be hardcoded in.remove = T
we remove the input columnssep = "-"
we define the separator between valuesNA
, we can also add na.rm = TRUE
# a x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i
data <- data.frame('a' = 1:3,
'b' = c('a','b','c'),
'c' = c('d', 'e', 'f'),
'd' = c('g', 'h', 'i'))
cols <- c('b','c','d')
data
# a b c d
# 1 1 a d g
# 2 2 b e h
# 3 3 c f i
*This solution is different from what already posted.
Upvotes: 7
Reputation: 196
I know this is an old question, but thought that I should anyway present the simple solution using the paste() function as suggested to by the questioner:
data_1<-data.frame(a=data$a,"x"=paste(data$b,data$c,data$d,sep="-"))
data_1
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
Upvotes: 2
Reputation: 193517
Here's a fairly unconventional (but fast) approach: use fwrite
from data.table
to "paste" the columns together, and fread
to read it back in. For convenience, I've written the steps as a function called fpaste
:
fpaste <- function(dt, sep = ",") {
x <- tempfile()
fwrite(dt, file = x, sep = sep, col.names = FALSE)
fread(x, sep = "\n", header = FALSE)
}
Here's an example:
d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i'))
cols = c("b", "c", "d")
fpaste(d[cols], "-")
# V1
# 1: a-d-g
# 2: b-e-h
# 3: c-f-i
How does it perform?
d2 <- d[sample(1:3,1e6,TRUE),]
library(microbenchmark)
microbenchmark(
docp = do.call(paste, c(d2[cols], sep="-")),
tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
fpaste = fpaste(d2[cols], "-")$V1,
dt2 = as.data.table(d2)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols][],
times=10)
# Unit: milliseconds
# expr min lq mean median uq max neval
# docp 215.34536 217.22102 220.3603 221.44104 223.27224 225.0906 10
# tidr 215.19907 215.81210 220.7131 220.09636 225.32717 229.6822 10
# docs 281.16679 285.49786 289.4514 286.68738 290.17249 312.5484 10
# appl 2816.61899 3106.19944 3259.3924 3266.45186 3401.80291 3804.7263 10
# fpaste 88.57108 89.67795 101.1524 90.59217 91.76415 197.1555 10
# dt2 301.95508 310.79082 384.8247 316.29807 383.94993 874.4472 10
Upvotes: 6
Reputation: 83215
In my opinion the sprintf
-function deserves a place among these answers as well. You can use sprintf
as follows:
do.call(sprintf, c(d[cols], '%s-%s-%s'))
which gives:
[1] "a-d-g" "b-e-h" "c-f-i"
And to create the required dataframe:
data.frame(a = d$a, x = do.call(sprintf, c(d[cols], '%s-%s-%s')))
giving:
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
Although sprintf
doesn't have a clear advantage over the do.call
/paste
combination of @BrianDiggs, it is especially usefull when you also want to pad certain parts of desired string or when you want to specify the number of digit. See ?sprintf
for the several options.
Another variant would be to use pmap
from purrr:
pmap(d[2:4], paste, sep = '-')
Note: this pmap
solution only works when the columns aren't factors.
A benchmark on a larger dataset:
# create a larger dataset
d2 <- d[sample(1:3,1e6,TRUE),]
# benchmark
library(microbenchmark)
microbenchmark(
docp = do.call(paste, c(d2[cols], sep="-")),
appl = apply( d2[, cols ] , 1 , paste , collapse = "-" ),
tidr = tidyr::unite_(d2, "x", cols, sep="-")$x,
docs = do.call(sprintf, c(d2[cols], '%s-%s-%s')),
times=10)
results in:
Unit: milliseconds
expr min lq mean median uq max neval cld
docp 214.1786 226.2835 297.1487 241.6150 409.2495 493.5036 10 a
appl 3832.3252 4048.9320 4131.6906 4072.4235 4255.1347 4486.9787 10 c
tidr 206.9326 216.8619 275.4556 252.1381 318.4249 407.9816 10 a
docs 413.9073 443.1550 490.6520 453.1635 530.1318 659.8400 10 b
Used data:
d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i'))
Upvotes: 10
Reputation: 6104
# your starting data..
data <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i'))
# columns to paste together
cols <- c( 'b' , 'c' , 'd' )
# create a new column `x` with the three columns collapsed together
data$x <- apply( data[ , cols ] , 1 , paste , collapse = "-" )
# remove the unnecessary columns
data <- data[ , !( names( data ) %in% cols ) ]
Upvotes: 128
Reputation: 92282
Just to add additional solution with Reduce
which probably is slower than do.call
but probebly better than apply
because it will avoid the matrix
conversion. Also, instead a for
loop we could just use setdiff
in order to remove unwanted columns
cols <- c('b','c','d')
data$x <- Reduce(function(...) paste(..., sep = "-"), data[cols])
data[setdiff(names(data), cols)]
# a x
# 1 1 a-d-g
# 2 2 b-e-h
# 3 3 c-f-i
Alternatively we could update data
in place using the data.table
package (assuming fresh data)
library(data.table)
setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD[, mget(cols)])]
data[, (cols) := NULL]
data
# a x
# 1: 1 a-d-g
# 2: 2 b-e-h
# 3: 3 c-f-i
Another option is to use .SDcols
instead of mget
as in
setDT(data)[, x := Reduce(function(...) paste(..., sep = "-"), .SD), .SDcols = cols]
Upvotes: 10
Reputation: 484
I benchmarked the answers of Anthony Damico, Brian Diggs and data_steve on a small sample tbl_df
and got the following results.
> data <- data.frame('a' = 1:3,
+ 'b' = c('a','b','c'),
+ 'c' = c('d', 'e', 'f'),
+ 'd' = c('g', 'h', 'i'))
> data <- tbl_df(data)
> cols <- c("b", "c", "d")
> microbenchmark(
+ do.call(paste, c(data[cols], sep="-")),
+ apply( data[ , cols ] , 1 , paste , collapse = "-" ),
+ tidyr::unite_(data, "x", cols, sep="-")$x,
+ times=1000
+ )
Unit: microseconds
expr min lq mean median uq max neval
do.call(paste, c(data[cols], sep = "-")) 65.248 78.380 93.90888 86.177 99.3090 436.220 1000
apply(data[, cols], 1, paste, collapse = "-") 223.239 263.044 313.11977 289.514 338.5520 743.583 1000
tidyr::unite_(data, "x", cols, sep = "-")$x 376.716 448.120 556.65424 501.877 606.9315 11537.846 1000
However, when I evaluated on my own tbl_df
with ~1 million rows and 10 columns the results were quite different.
> microbenchmark(
+ do.call(paste, c(data[c("a", "b")], sep="-")),
+ apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" ),
+ tidyr::unite_(data, "c", c("a", "b"), sep="-")$c,
+ times=25
+ )
Unit: milliseconds
expr min lq mean median uq max neval
do.call(paste, c(data[c("a", "b")], sep="-")) 930.7208 951.3048 1129.334 997.2744 1066.084 2169.147 25
apply( data[ , c("a", "b") ] , 1 , paste , collapse = "-" ) 9368.2800 10948.0124 11678.393 11136.3756 11878.308 17587.617 25
tidyr::unite_(data, "c", c("a", "b"), sep="-")$c 968.5861 1008.4716 1095.886 1035.8348 1082.726 1759.349 25
Upvotes: 8
Reputation: 58825
As a variant on baptiste's answer, with data
defined as you have and the columns that you want to put together defined in cols
cols <- c("b", "c", "d")
You can add the new column to data
and delete the old ones with
data$x <- do.call(paste, c(data[cols], sep="-"))
for (co in cols) data[co] <- NULL
which gives
> data
a x
1 1 a-d-g
2 2 b-e-h
3 3 c-f-i
Upvotes: 57
Reputation: 15441
library(plyr)
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[2:4],sep="",collapse="-"))))
# x
#1 a-d-g
#2 b-e-h
#3 c-f-i
# and with just the vector of names you have:
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[c('b','c','d')],sep="",collapse="-"))))
# or equally:
mynames <-c('b','c','d')
ldply(apply(data, 1, function(x) data.frame(
x = paste(x[mynames],sep="",collapse="-"))))
Upvotes: 0
Reputation: 77096
I'd construct a new data.frame:
d <- data.frame('a' = 1:3, 'b' = c('a','b','c'), 'c' = c('d', 'e', 'f'), 'd' = c('g', 'h', 'i'))
cols <- c( 'b' , 'c' , 'd' )
data.frame(a = d[, 'a'], x = do.call(paste, c(d[ , cols], list(sep = '-'))))
Upvotes: 16