user1165199
user1165199

Reputation: 6649

Count number of time combination of events appear in dataframe columns ext

This is an extension of the question asked in Count number of times combination of events occurs in dataframe columns, I will reword the question again so it is all here:

I have a data frame and I want to calculate the number of times each combination of events in two columns occur (in any order), with a zero if a combination doesn't appear.

For example say I have

df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'), 
                 'y' = c('c', 'c', 'a', 'a', 'b'))

So

x y  
a c  
b c  
c a  
c a  
c a  
c b

a and b do not occur together, a and c 4 times (rows 2, 4, 5, 6) and b and c twice (3rd and 7th rows) so I would want to return

x-y num  
a-b 0  
a-c 4  
b-c 2  

I hope this makes sense? Thanks in advance

Upvotes: 0

Views: 2032

Answers (3)

alexwhan
alexwhan

Reputation: 16026

An alternative, because I was a bit bored. Perhaps a bit more generalised? But probably still uglier than it could be...

df2 <- as.data.frame(table(df))
df2$com <- apply(df2[,1:2],1,function(x) if(x[1] != x[2]) paste(sort(x),collapse='-'))
df2 <- df2[df2$com != "NULL",]
ddply(df2, .(unlist(com)), summarise, 
      num = sum(Freq))

Upvotes: 0

seandavi
seandavi

Reputation: 2968

This should do it:

res = table(df)

To convert to data frame:

resdf = as.data.frame(res)

The resdf data.frame looks like:

  x y Freq
1 a a    0
2 b a    0
3 c a    2
4 a b    0
5 b b    0
6 c b    1
7 a c    1
8 b c    1
9 c c    0

Note that this answer takes order into account. If ordering of the columns is unimportant, then modifying the original data.frame prior to the process will remove the effect of ordering (a-c treated the same as c-a).

df1 = as.data.frame(t(apply(df,1,sort)))

Upvotes: 4

Rcoster
Rcoster

Reputation: 3210

As said, you can do this with factor() and expand.grid() (or another way to get all possible combinations)

all.possible <- expand.grid(c('a','b','c'), c('a','b','c'))
all.possible <- all.possible[all.possible[, 1] != all.possible[, 2], ]
all.possible <- unique(apply(all.possible, 1, function(x) paste(sort(x), collapse='-')))

df <- data.frame('x' = c('a', 'b', 'c', 'c', 'c'), 
                 'y' = c('c', 'c', 'a', 'a', 'b'))
table(factor(apply(df , 1, function(x) paste(sort(x), collapse='-')), levels=all.possible))

Upvotes: 1

Related Questions