Jason Born
Jason Born

Reputation: 39

Select column names with specific values

I have a data set with the 500+ column names and 567800 rows. Each column is having values like 0,1,2,3 etc.

I want to create a separate column with pasting of columnnames where values are not equal to 0.

I tried running loop in R however I am not getting desired result and it is taking a very long time. My data looks like this :

DOC_ID  my_name your_name his_name our_name.....546 columns
  1        0         1         0        5   
  2        5         0         2        6   
  3        0         1         5        0
  ....

Desired Output is :

DOC_ID  my_name your_name his_name our_name..     paste_column
  1        0         1         0        5       your_name our_name
  2        5         0         2        0       my_name his_name
  3        0         1         5        0       your_name his_name
  ...

Please suggest a faster and workable code as apply is taking a lot of time to give output

Upvotes: 1

Views: 1489

Answers (4)

Uwe
Uwe

Reputation: 42544

You can try this

library(data.table)
library(stringr)

set.seed(1234)
nc <- 500L
nr <- 100000L

### SETTING UP DUMMY DATA
# create data.table 
dt <- data.table(id = seq_len(nr))
# make space for additional columns
alloc.col(dt, 1L + nc)
# create new columns and fill with dummy data
for (jc in seq_len(nc)) {
  set(dt, j = paste0("V", jc), value = sample(0:5, nr, replace = TRUE))
}

### PROPOSED SOLUTION
# reshape from wide to long
mt <- melt(dt, id.vars = "id")
# keep names of non zero columns
mt[value == 0, variable := ""]
# remove value column to save RAM
mt[, value := NULL]
# aggregate by id
ct <- mt[, .(nonzerocols = str_c(variable, collapse = " ")), keyby = id]
# clean up blanks
ct[, nonzerocols := str_trim(str_replace_all(nonzerocols, "[ ]{2,}", " "))]
# add new column to original data.table (merge)
dt[ct, on = "id"]

This example with 500 cols x 100'000 rows took about 12 secs elapsed time on my PC and required 770 MB RAM (according to tables()). So, memory consumption of the molten data could be an issue.

Edit

Sotos has suggested a clever one-liner using apply from base R. On the other hand, the OP had claimed in his question that apply is taking a lot of time to give output. So, I did a benchmark of

  1. my answer using data.table and reshape
  2. an improved version of the reshape approach
  3. Sotos' answer using apply

For a fair comparison with Sotos's answer, the final merge in my answer is skipped.

The improved version of the reshape approach removes all zero rows from the molten data.table. This should become more efficient the more zeros are included in the data.

reshape2 <- function(dt) {
  # reshape from wide to long
  mt <- melt(dt, id.vars = "id")
  # keep names only of non zero columns => remove rows with zeros
  mt <- mt[value != 0]
  # remove value column to save RAM
  mt[, value := NULL]
  # aggregate by id
  ct <- mt[, .(nonzerocols = str_c(variable, collapse = " ")), keyby = id]
  return(ct)
}

Benchmarking is done with help of the microbenchmark package:

microbenchmark(
  r1 <- reshape1(dt),
  r2 <- reshape2(dt),
  r3 <- sotos(dt),
  times = 3)

With 500 cols x 100'000 rows the results were:

Unit: seconds
               expr      min       lq     mean   median       uq      max neval
 r1 <- reshape1(dt) 8.989972 9.110466 9.193764 9.230960 9.295660 9.360360     3
 r2 <- reshape2(dt) 4.727613 5.131648 5.300693 5.535683 5.587234 5.638785     3
    r3 <- sotos(dt) 7.528728 7.858817 8.020705 8.188905 8.266693 8.344480     3

For large numbers of columns and rows, the improved reshape approach shows the best performance. However, Sotos' concise one-liner is more efficient in terms of memory consumption. I was able to run a 500 cols x 500'000 rows example on my PC with 8 MB of RAM which took 44 secs while my improved reshape approach failed due to lack of memory. So, Sotos gets my +1.

Upvotes: 2

rafa.pereira
rafa.pereira

Reputation: 13807

library(data.table)


setDT(df)[ , paste_column := paste( names( df[,colMeans(df > 0)==T, with=F] ), collapse = ' ') ]

here is a reproducible example:

dt <- data.table(V1=1:5, V2=3:7, V3=7:3)

dt[ , paste_column  := paste( names( dt[,colMeans(dt>2)==T, with=F] ), collapse = ' ') ]

>    V1 V2 V3 paste_column
> 1:  1  3  7        V2 V3
> 2:  2  4  6        V2 V3
> 3:  3  5  5        V2 V3
> 4:  4  6  4        V2 V3
> 5:  5  7  3        V2 V3

Upvotes: 0

noumenal
noumenal

Reputation: 1237

You probably want to ignore the zeros and treat them as missing values:

mytext1[mytext1==0] <- NA

Then, just run commands with

na.rm = TRUE

for example

mean(x, na.rm=TRUE)

Upvotes: 0

Sotos
Sotos

Reputation: 51582

One way using base R,

apply(df, 1, function(i) paste(names(i[i > 0 ]), collapse = ''))

or

apply(df, 1, function(i) paste(names(i[i != 0 ]), collapse = ''))

Upvotes: 3

Related Questions