Reputation: 735

R all combinations of dafaframe based on ifelse

I am struggling with the following problem here: I have a dataframe that looks like this:

  aa<-c(0,0,0,1,1,0,0)
  bb<-c(1,1,0,0,1,0,1)
  cc<-c(0,1,0,0,0,1,0)

  d<-data.frame(aa,bb,cc)

The data is always binary and codes for absence/presence data. What I would like to have are new columns with all possible combinations of the variables meeting certain assumptions. For this dataframe it would be like

d$aabb<-ifelse(d$aa=="1"&d$bb=="1"&d$cc=="0",1,0) #aa=1,bb=1,cc=0
d$aacc<-ifelse(d$aa=="1"&d$cc=="1"&d$bb=="0",1,0) #aa=1,bb=0,cc=1
d$bbcc<-ifelse(d$bb=="1"&d$cc=="1"&d$aa=="0",1,0) #aa=0,bb=1,cc=0
d$daabbcc<-ifelse(d$aa=="1"&d$bb=="1"&d$cc=="1",1,0) #aa=bb==cc=1

However, I have 30 Columns and I dont want to fill them all out by hand. Another nice thing would be if the resulting column names are a combination of the original ones (aa+bb->aabb), etc.

I looked at the expand.grid() function but this was not what I was looking for Thanks in advance

Upvotes: 3

Answers (4)

Thell

Reputation: 5958

A possible solution using the sets package...

Setup as given:

aa <- c(0, 0, 0, 1, 1, 0, 0)
bb <- c(1, 1, 0, 0, 1, 0, 1)
cc <- c(0, 1, 0, 0, 0, 1, 0)

d <- data.frame(aa, bb, cc)

And prep environment...

require(sets, quietly = T)
require(data.table, quietly = T)

Create a unique listing of names in 'set' order by creating a set of sets from d.

# Created as a list so that duplicates are kept.
namesets <- sapply(seq_len(nrow(d)), function(i) {
    gset(colnames(d), memberships = d[i, ])
})

# Then combine the set memberships into names and assign to the sets.
setnames <- sapply(namesets, function(s) {
    ifelse(set_is_empty(s), "none", paste(as.character(s), collapse = ""))
})
names(namesets) <- setnames

# Creating set of sets from namesets orders the names and removes duplicates.
namesets <- as.set(namesets)

print(namesets)

## {none = {}, aa = {"aa"}, bb = {"bb"}, cc = {"cc"}, aabb = {"aa",
##  "bb"}, bbcc = {"bb", "cc"}}

# Making it easy to create an ordered listing that we can use as a key.
setnames <- ordered(setnames, levels = names(namesets))
print(setnames)

## [1] bb   bbcc none aa   aabb cc   bb  
## Levels: none < aa < bb < cc < aabb < bbcc

Converting d into a data.table we can then populate member-set columns in various ways...

# First a simple membership to key-by
dt <- data.table(membership = setnames, d, key = "membership")
print(dt)

##    membership aa bb cc
## 1:       none  0  0  0
## 2:         aa  1  0  0
## 3:         bb  0  1  0
## 4:         bb  0  1  0
## 5:         cc  0  0  1
## 6:       aabb  1  1  0
## 7:       bbcc  0  1  1


# That might be enough for some, but the OP wants columns
# indicating a membership; so just join a matrix...
membership.map <- t(sapply(dt$membership, function(m) {
    m == levels(dt$membership)
}) * 1)
colnames(membership.map) <- levels(dt$membership)

dt <- cbind(dt, split = " ==> ", membership.map)

print(dt)

##    membership aa bb cc split none aa bb cc aabb bbcc
## 1:       none  0  0  0  ==>     1  0  0  0    0    0
## 2:         aa  1  0  0  ==>     0  1  0  0    0    0
## 3:         bb  0  1  0  ==>     0  0  1  0    0    0
## 4:         bb  0  1  0  ==>     0  0  1  0    0    0
## 5:         cc  0  0  1  ==>     0  0  0  1    0    0
## 6:       aabb  1  1  0  ==>     0  0  0  0    1    0
## 7:       bbcc  0  1  1  ==>     0  0  0  0    0    1

This can all be wrapped up in a quick and dirty function like so:

membership.table <- function(df) {

    namesets <- sapply(seq_len(nrow(d)), function(i) {
        gset(colnames(d), memberships = d[i, ])
    })

    setnames <- sapply(namesets, function(s) {
        ifelse(set_is_empty(s), "none", paste(as.character(s), collapse = ""))
    })
    names(namesets) <- setnames

    namesets <- as.set(namesets)
    setnames <- ordered(setnames, levels = names(namesets))

    dt <- data.table(membership = setnames, d, key = "membership")
    membership.map <- t(sapply(dt$membership, function(m) {
        m == levels(dt$membership)
    }) * 1)
    colnames(membership.map) <- levels(dt$membership)

    cbind(dt, split = " ==> ", membership.map)
}



mt <- membership.table(d)
identical(dt, mt)

## [1] TRUE

And we should now getting matching results when summarizing the membership table by-key and the membership information when creating generalized set from the original data.

mt[, lapply(.SD, sum), by = membership, .SDcols = seq(3 + ncol(d), ncol(mt))]

##    membership none aa bb cc aabb bbcc
## 1:       none    1  0  0  0    0    0
## 2:         aa    0  1  0  0    0    0
## 3:         bb    0  0  2  0    0    0
## 4:         cc    0  0  0  1    0    0
## 5:       aabb    0  1  1  0    1    0
## 6:       bbcc    0  0  1  1    0    1


as.list(as.gset(d))

## $`3`
## (aa = 0, bb = 0, cc = 0)
## 
## $`6`
## (aa = 0, bb = 0, cc = 1)
## 
## $`1`
## (aa = 0, bb = 1, cc = 0)
## 
## $`2`
## (aa = 0, bb = 1, cc = 1)
## 
## $`4`
## (aa = 1, bb = 0, cc = 0)
## 
## $`5`
## (aa = 1, bb = 1, cc = 0)
## 
## attr(,"memberships")
## 
## 1 2 3 4 5 6 
## 1 1 2 1 1 1

Notice that bb has a sum of 2 in the membership table, and the third item in the generalized set list (indicating bb) also shows 2 such sets.

If this same algorithm is applied to Hong's example then the results are:

##     membership a b c d e f split a bc ce abd acd ade abef acdef abcdef
##  1:          a 1 0 0 0 0 0  ==>  1  0  0   0   0   0    0     0      0
##  2:         bc 0 1 1 0 0 0  ==>  0  1  0   0   0   0    0     0      0
##  3:         ce 0 0 1 0 1 0  ==>  0  0  1   0   0   0    0     0      0
##  4:        abd 1 1 0 1 0 0  ==>  0  0  0   1   0   0    0     0      0
##  5:        acd 1 0 1 1 0 0  ==>  0  0  0   0   1   0    0     0      0
##  6:        ade 1 0 0 1 1 0  ==>  0  0  0   0   0   1    0     0      0
##  7:       abef 1 1 0 0 1 1  ==>  0  0  0   0   0   0    1     0      0
##  8:      acdef 1 0 1 1 1 1  ==>  0  0  0   0   0   0    0     1      0
##  9:     abcdef 1 1 1 1 1 1  ==>  0  0  0   0   0   0    0     0      1
## 10:     abcdef 1 1 1 1 1 1  ==>  0  0  0   0   0   0    0     0      1

While this solution does more (like sorting and ordering) the timing isn't too horrible compared to Hong's solution; but compared to Thomas's...

## Unit: milliseconds
##        expr     min      lq  median      uq     max neval
##          hf 241.810 246.411 253.634 262.544 290.345    10
##          mt 128.105 137.931 142.966 154.244 210.276    10
##          tf   1.754   1.768   1.806   2.312   3.487    10
##  plain.gset   1.220   1.330   1.386   1.475   1.644    10

... both solutions are slow. And without a doubt, if you just need to work with the sets then perhaps a little time in the sets vignette's would be worthwhile for larger memberships.

Upvotes: 1

Hong Ooi

Reputation: 57686

Regardless of its applicability to an actual problem, this was kind of an interesting programming exercise. Here's code to create all 63 (=2^6 - 1) possible combinations from 6 columns, excluding the null. (As an aside, I don't see what's unclear about the question; it says "all possible combinations" in the second sentence, and one of the variables created in the sample code is all zeros (d$aabbcc.))

# create the source data
d <- data.frame(matrix(rbinom(60, 1, 0.5), ncol=6))
names(d) <- letters[1:6]


# generate matrix of all possible combinations (except the null)
v <- as.matrix(expand.grid(rep(list(c(FALSE, TRUE)), ncol(d))))[-1, ]

# convert the matrix into a list of column indexes
indexes <- lapply(seq_len(nrow(v)), function(x) v[x, ])
names(indexes) <- apply(v, 1, function(x) paste(names(d)[x], collapse="."))

# compute values from the source data
out <- data.frame(lapply(indexes, function(i) as.numeric(apply(d[i], 1, all))))

There's some unnecessary computations going on, most obviously in how later combinations don't reuse the values from earlier ones. Still, this takes a fraction of a second even with 1000 rows, and only a few seconds with 100000 rows. Seeing as the problem is only feasible for a small number of columns, I didn't think further optimisation was worth the trouble.

Upvotes: 2

Carl Witthoft

Reputation: 21492

Since all data are binary,aka logical, why not convert each potential combination into a number (zero thru 2^N), then, similar to @Thomas answer, convert each row in the dataframe to a single binary sequence, and then your new columns will simply be row_value[j] == column_numeric_value[k] (cheap pseudocode). That is, for a simple 3-column input, there are 8 possible outputs. If row[j] is 1 0 1 then row_value[j] is decimal "5", and row_value[j] == column_numeric_value[5] is true, and is false for all other columns.

Upvotes: 0

Thomas

Reputation: 44525

Some data:

aa<-c(0,0,0,1,1,0,0)
bb<-c(1,1,0,0,1,0,1)
cc<-c(0,1,0,0,0,1,0)
dd<-rbinom(7,1,.5)
ee<-rbinom(7,1,.5)
ff<-rbinom(7,1,.5)
d<-data.frame(aa,bb,cc,dd,ee,ff)

Create a variable that is all possible combinations of the values:

combinations <- apply(d,1,function(x) paste(names(d)[as.logical(x)],collapse=""))

Convert that variable into a set of named variables and bind the results to d:

d2 <- sapply(unique(combinations), function(x) as.numeric(combinations==x))

Prevent duplicated column names when only one value is present in the original df:

colnames(d2) <- paste0(colnames(d2),"1") # could be any naming convention
d2 <- cbind(d, d2)

Upvotes: 4

R all combinations of dafaframe based on ifelse

Answers (4)

A possible solution using the sets package...

Related Questions