Reputation: 4844

Compare intersections between groups specified in first column

Lets say I have a dataframe of three columns: The first one specifies the number of a feature (e.g. color), the second one a group and the third one if the feature is present in that group (1) or missing in that group (0):

> d<-data.frame(feature=c("red","blue","green","yellow","red","blue","green","yellow"), group=c(rep("a",4),rep("b",4)),is_there=c(0,1,1,0,1,1,1,0))
> d
  feature group is_there
1     red     a        0
2    blue     a        1
3   green     a        1
4  yellow     a        0
5     red     b        1
6    blue     b        1
7   green     b        1
8  yellow     b        0

Now I would like to have a summary of how many features are: 1. only in group a, only in group b and how many are in present in both groups. Additionally I need to extract the name of features present in both groups. How can I do that? I imagine that a function like crossprod might help, but I cannot figure it out.

The output would be something like:

feature 
red     1
blue    2
green   2
yellow  0

or:

feature a b
red     0 1
blue    1 1
green   1 1
yellow  0 0

anyways i need a better overview over a quite big datafile (the original has hundreds of features in about 10 groups).

Upvotes: 1

Answers (5)

Rich Scriven

Reputation: 99331

It sounds like a table is what you want. First we subset the rows such that the is_there column equals 1 and remove the third column. Then we call a table on that subset.

> ( tab <- table(d[d$is_there == 1, -3]) )
#         group
# feature  a b
#   blue   1 1
#   green  1 1
#   red    0 1
#   yellow 0 0

A table is a matrix-like object. We can operate on it in much the same way we operate on a matrix.

Looking at group a :

> tab[,"a"]                           ## vector of group "a"
#  blue  green    red yellow 
#     1      1      0      0 
> tab[,"a"][ tab[,"a"] > 0 ]          ## present in group "a"
#  blue green 
#     1     1 
> names(tab[,"a"][ tab[,"a"] > 0 ])   ## "feature" present in group "a"
# [1] "blue"  "green"

And the same for group b.

Upvotes: 2

rnso

Reputation: 24535

Try following code:

with(d, tapply(is_there, list(feature, group), sum))
#       a b
#blue   1 1
#green  1 1
#red    0 1
#yellow 0 0

Upvotes: 1

akrun

Reputation: 887028

 tbl <- table(d$feature[!!d$is_there], d$group[!!d$is_there])
 rowSums(tbl)
 #blue  green    red yellow 
 #  2      2      1      0 

 tbl

 #       a b
 #blue   1 1
 #green  1 1
 #red    0 1
 #yellow 0 0

If you wanted to have the groupings like below:

  d1 <- as.data.frame(matrix(rep(c("none", "only", "both")[rowSums(tbl)+1],
           each=2), ncol=2, byrow=TRUE, dimnames=dimnames(tbl)),
                                          stringsAsFactors=FALSE)

  d1[!tbl & rowSums(tbl)==1]  <- ""
  d1
 #        a    b
 #blue   both both
 #green  both both
 #red         only
 #yellow none none

Upvotes: 1

Joris Meys

Reputation: 108523

Take following data frame:

myd <- data.frame(
  feature=c("red","blue","green","yellow","red","blue","green","yellow"),
  group=c(rep("a",4),rep("b",4)),
  is_there=c(0,1,1,0,1,0,1,0))

To get a factor telling you where everything is, you can try this code:

require(reshape2)

res <- acast(myd,feature ~ group, fun=sum, value.var="is_there")
where <- factor(
  colSums(res) - 2*diff(t(res)),
  levels=c(-1,0,2,3),
  labels=c("group2","nowhere","both","group1")
  )

Gives :

> res
       a b
blue   1 0
green  1 1
red    0 1
yellow 0 0
> where
   blue   green     red  yellow 
 group1    both  group2 nowhere 
Levels: group2 nowhere both group1

Extracting those that are present everywhere is trivial from here.

Note that any of the other solutions giving you the matrix res are equally valid (the tapply solution will be faster)

Upvotes: 0

Benoit

Reputation: 1234

would that do the trick?

> tapply(d$feature[d$is_there==1],d$group[d$is_there==1], table)

$a
blue  green    red yellow 
   1      1      0      0 

$b
blue  green    red yellow 
   1      1      1      0

Upvotes: 0

Compare intersections between groups specified in first column

Answers (5)

Related Questions