Reputation: 4844
Lets say I have a dataframe of three columns: The first one specifies the number of a feature (e.g. color), the second one a group and the third one if the feature is present in that group (1) or missing in that group (0):
> d<-data.frame(feature=c("red","blue","green","yellow","red","blue","green","yellow"), group=c(rep("a",4),rep("b",4)),is_there=c(0,1,1,0,1,1,1,0))
> d
feature group is_there
1 red a 0
2 blue a 1
3 green a 1
4 yellow a 0
5 red b 1
6 blue b 1
7 green b 1
8 yellow b 0
Now I would like to have a summary of how many features are: 1. only in group a, only in group b and how many are in present in both groups. Additionally I need to extract the name of features present in both groups. How can I do that? I imagine that a function like crossprod
might help, but I cannot figure it out.
The output would be something like:
feature
red 1
blue 2
green 2
yellow 0
or:
feature a b
red 0 1
blue 1 1
green 1 1
yellow 0 0
anyways i need a better overview over a quite big datafile (the original has hundreds of features in about 10 groups).
Upvotes: 1
Views: 79
Reputation: 99331
It sounds like a table
is what you want. First we subset the rows such that the is_there
column equals 1 and remove the third column. Then we call a table
on that subset.
> ( tab <- table(d[d$is_there == 1, -3]) )
# group
# feature a b
# blue 1 1
# green 1 1
# red 0 1
# yellow 0 0
A table
is a matrix-like object. We can operate on it in much the same way we operate on a matrix
.
Looking at group a
:
> tab[,"a"] ## vector of group "a"
# blue green red yellow
# 1 1 0 0
> tab[,"a"][ tab[,"a"] > 0 ] ## present in group "a"
# blue green
# 1 1
> names(tab[,"a"][ tab[,"a"] > 0 ]) ## "feature" present in group "a"
# [1] "blue" "green"
And the same for group b
.
Upvotes: 2
Reputation: 24535
Try following code:
with(d, tapply(is_there, list(feature, group), sum))
# a b
#blue 1 1
#green 1 1
#red 0 1
#yellow 0 0
Upvotes: 1
Reputation: 887028
tbl <- table(d$feature[!!d$is_there], d$group[!!d$is_there])
rowSums(tbl)
#blue green red yellow
# 2 2 1 0
tbl
# a b
#blue 1 1
#green 1 1
#red 0 1
#yellow 0 0
If you wanted to have the groupings like below:
d1 <- as.data.frame(matrix(rep(c("none", "only", "both")[rowSums(tbl)+1],
each=2), ncol=2, byrow=TRUE, dimnames=dimnames(tbl)),
stringsAsFactors=FALSE)
d1[!tbl & rowSums(tbl)==1] <- ""
d1
# a b
#blue both both
#green both both
#red only
#yellow none none
Upvotes: 1
Reputation: 108523
Take following data frame:
myd <- data.frame(
feature=c("red","blue","green","yellow","red","blue","green","yellow"),
group=c(rep("a",4),rep("b",4)),
is_there=c(0,1,1,0,1,0,1,0))
To get a factor telling you where everything is, you can try this code:
require(reshape2)
res <- acast(myd,feature ~ group, fun=sum, value.var="is_there")
where <- factor(
colSums(res) - 2*diff(t(res)),
levels=c(-1,0,2,3),
labels=c("group2","nowhere","both","group1")
)
Gives :
> res
a b
blue 1 0
green 1 1
red 0 1
yellow 0 0
> where
blue green red yellow
group1 both group2 nowhere
Levels: group2 nowhere both group1
Extracting those that are present everywhere is trivial from here.
Note that any of the other solutions giving you the matrix res
are equally valid (the tapply solution will be faster)
Upvotes: 0
Reputation: 1234
would that do the trick?
> tapply(d$feature[d$is_there==1],d$group[d$is_there==1], table)
$a
blue green red yellow
1 1 0 0
$b
blue green red yellow
1 1 1 0
Upvotes: 0