Reputation: 39
I have a data set with the 500+ column names and 567800 rows. Each column is having values like 0,1,2,3 etc.
I want to create a separate column with pasting of columnnames where values are not equal to 0.
I tried running loop in R however I am not getting desired result and it is taking a very long time. My data looks like this :
DOC_ID my_name your_name his_name our_name.....546 columns
1 0 1 0 5
2 5 0 2 6
3 0 1 5 0
....
Desired Output is :
DOC_ID my_name your_name his_name our_name.. paste_column
1 0 1 0 5 your_name our_name
2 5 0 2 0 my_name his_name
3 0 1 5 0 your_name his_name
...
Please suggest a faster and workable code as apply is taking a lot of time to give output
Upvotes: 1
Views: 1489
Reputation: 42544
You can try this
library(data.table)
library(stringr)
set.seed(1234)
nc <- 500L
nr <- 100000L
### SETTING UP DUMMY DATA
# create data.table
dt <- data.table(id = seq_len(nr))
# make space for additional columns
alloc.col(dt, 1L + nc)
# create new columns and fill with dummy data
for (jc in seq_len(nc)) {
set(dt, j = paste0("V", jc), value = sample(0:5, nr, replace = TRUE))
}
### PROPOSED SOLUTION
# reshape from wide to long
mt <- melt(dt, id.vars = "id")
# keep names of non zero columns
mt[value == 0, variable := ""]
# remove value column to save RAM
mt[, value := NULL]
# aggregate by id
ct <- mt[, .(nonzerocols = str_c(variable, collapse = " ")), keyby = id]
# clean up blanks
ct[, nonzerocols := str_trim(str_replace_all(nonzerocols, "[ ]{2,}", " "))]
# add new column to original data.table (merge)
dt[ct, on = "id"]
This example with 500 cols x 100'000 rows took about 12 secs elapsed time on my PC and required 770 MB RAM (according to tables()
). So, memory consumption of the molten data could be an issue.
Sotos has suggested a clever one-liner using apply
from base R. On the other hand, the OP had claimed in his question that apply
is taking a lot of time to give output. So, I did a benchmark of
For a fair comparison with Sotos's answer, the final merge in my answer is skipped.
The improved version of the reshape approach removes all zero rows from the molten data.table. This should become more efficient the more zeros are included in the data.
reshape2 <- function(dt) {
# reshape from wide to long
mt <- melt(dt, id.vars = "id")
# keep names only of non zero columns => remove rows with zeros
mt <- mt[value != 0]
# remove value column to save RAM
mt[, value := NULL]
# aggregate by id
ct <- mt[, .(nonzerocols = str_c(variable, collapse = " ")), keyby = id]
return(ct)
}
Benchmarking is done with help of the microbenchmark
package:
microbenchmark(
r1 <- reshape1(dt),
r2 <- reshape2(dt),
r3 <- sotos(dt),
times = 3)
With 500 cols x 100'000 rows the results were:
Unit: seconds
expr min lq mean median uq max neval
r1 <- reshape1(dt) 8.989972 9.110466 9.193764 9.230960 9.295660 9.360360 3
r2 <- reshape2(dt) 4.727613 5.131648 5.300693 5.535683 5.587234 5.638785 3
r3 <- sotos(dt) 7.528728 7.858817 8.020705 8.188905 8.266693 8.344480 3
For large numbers of columns and rows, the improved reshape approach shows the best performance. However, Sotos' concise one-liner is more efficient in terms of memory consumption. I was able to run a 500 cols x 500'000 rows example on my PC with 8 MB of RAM which took 44 secs while my improved reshape approach failed due to lack of memory. So, Sotos gets my +1.
Upvotes: 2
Reputation: 13807
library(data.table)
setDT(df)[ , paste_column := paste( names( df[,colMeans(df > 0)==T, with=F] ), collapse = ' ') ]
here is a reproducible example:
dt <- data.table(V1=1:5, V2=3:7, V3=7:3)
dt[ , paste_column := paste( names( dt[,colMeans(dt>2)==T, with=F] ), collapse = ' ') ]
> V1 V2 V3 paste_column
> 1: 1 3 7 V2 V3
> 2: 2 4 6 V2 V3
> 3: 3 5 5 V2 V3
> 4: 4 6 4 V2 V3
> 5: 5 7 3 V2 V3
Upvotes: 0
Reputation: 1237
You probably want to ignore the zeros and treat them as missing values:
mytext1[mytext1==0] <- NA
Then, just run commands with
na.rm = TRUE
for example
mean(x, na.rm=TRUE)
Upvotes: 0
Reputation: 51582
One way using base R,
apply(df, 1, function(i) paste(names(i[i > 0 ]), collapse = ''))
or
apply(df, 1, function(i) paste(names(i[i != 0 ]), collapse = ''))
Upvotes: 3