Antti
Antti

Reputation: 1293

How to create combinatory variable in R data.frame?

I have a data.frame that has several variables with zero values. I need to construct an extra variable that would return the combination of variables that are not zero for each observation. E.g.

df <- data.frame(firm = c("firm1", "firm2", "firm3", "firm4", "firm5"),
                 A = c(0, 0, 0, 1, 2),
                 B = c(0, 1, 0, 42, 0),
                 C = c(1, 1, 0, 0, 0))

Now I would like to generate the new variable:

df$varCombination <- c("C", "B-C", NA, "A-B", "A")

I thought up something like this, which obviously did not work:

for (i in 1:nrow(df)){
    df$varCombination[i] <- paste(names(df[i,2:ncol(df) & > 0]), collapse = "-")
}

Upvotes: 3

Views: 119

Answers (3)

David Arenburg
David Arenburg

Reputation: 92282

This could be probably solved easily using apply(df, 1, fun), but here is an attempt to solve this column wise instead of row wise for performance sake (I once saw something similar done by @alexis_laz but can't find it right now)

## Create a logical matrix
tmp <- df[-1] != 0
## or tmp <- sapply(df[-1], `!=`, 0)

## Prealocate result 
res <- rep(NA, nrow(tmp))

## Run per column instead of per row
for(j in colnames(tmp)){
  res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
}

## Remove the pre-allocated `NA` values from non-NA entries
gsub("NA-", "", res, fixed = TRUE)
# [1] "C"   "B-C" NA    "A-B" "A"

Some benchmarks on a bigger data set

set.seed(123)
BigDF <- as.data.frame(matrix(sample(0:1, 1e4, replace = TRUE), ncol = 10))

library(microbenchmark)

MM <- function(df) {
  var_names <- names(df)[-1]
  res <- character(nrow(df))
  for (i in 1:nrow(df)){
    non_zero_names <- var_names[df[i, -1] > 0]
    res[i] <- paste(non_zero_names, collapse  = '-')
  }
  res
}

ZX <- function(df) {
  res <- 
    apply(df[,2:ncol(df)]>0, 1,
          function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))
  res[res == ""] <- NA
  res
}

DA <- function(df) {
  tmp <- df[-1] != 0
  res <- rep(NA, nrow(tmp))

  for(j in colnames(tmp)){
    res[tmp[, j]] <- paste(res[tmp[, j]], j, sep = "-")
  }
  gsub("NA-", "", res, fixed = TRUE)
}


microbenchmark(MM(BigDF), ZX(BigDF), DA(BigDF))
# Unit: milliseconds
#      expr       min         lq       mean     median         uq        max neval cld
# MM(BigDF) 239.36704 248.737408 253.159460 252.177439 255.144048 289.340528   100   c
# ZX(BigDF)  35.83482  37.617473  38.295425  38.022897  38.357285  76.619853   100  b 
# DA(BigDF)   1.62682   1.662979   1.734723   1.735296   1.761695   2.725659   100 a  

Upvotes: 6

Mhairi McNeill
Mhairi McNeill

Reputation: 1981

You had the right idea but the logical comparison in your loop wasn't correct.

I've attempted to keep the code fairly similar to what you had before, this should work:

var_names <- names(df)[-1]

df$varCombination <- character(nrow(df))

for (i in 1:nrow(df)){

  non_zero_names <- var_names[df[i, -1] > 0]

  df$varCombination[i] <- paste(non_zero_names, collapse  = '-')

}

> df
   firm A  B C varCombination
1 firm1 0  0 1              C
2 firm2 0  1 1            B-C
3 firm3 0  0 0               
4 firm4 1 42 0            A-B
5 firm5 2  0 0              A

Upvotes: 1

zx8754
zx8754

Reputation: 56004

Using apply:

# paste column names
df$varCombination <- 
  apply(df[,2:ncol(df)]>0, 1,
        function(i)paste(colnames(df[, 2:ncol(df)])[i], collapse = "-"))

# convert blank to NA
df$varCombination[df$varCombination == ""] <- NA

# result
df
#    firm A  B C varCombination
# 1 firm1 0  0 1              C
# 2 firm2 0  1 1            B-C
# 3 firm3 0  0 0           <NA>
# 4 firm4 1 42 0            A-B
# 5 firm5 2  0 0              A

Upvotes: 5

Related Questions