roadonfire
roadonfire

Reputation: 13

R-Software: Counting occurrence combination in a column based on second column

I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software. My data is of format:

INPUT
col1     col2
A         q
C         w
B         e
A         r
A         t
A         y
C         q
B         w
C         e
C         r
B         t
C         y

DESIRED OUTPUT
unit1     unit2     same_col2_freq
A          B          1
A          C          3
B          A          1
B          C          2
C          A          3
C          B          2 

That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1. B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2. .... and so on for all combinations in col1.

I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.

I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.

Any help is greatly appreciated.

Thanks!

Upvotes: 0

Views: 275

Answers (2)

akrun
akrun

Reputation: 887981

Here is a similar approach (as showed by @cogitovita), but using data.table. Convert the "data.frame" to "data.table" using setDT, then Cross Join (CJ) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2), get the count (.N), grouped by the new columns (.(V1, V2)) and finally order the columns (order(V1,V2))

library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2, 
         .N, .(V1,V2)][order(V1,V2)]  
#   V1 V2 N
#1:  A  B 1
#2:  A  C 3
#3:  B  A 1
#4:  B  C 2
#5:  C  A 3
#6:  C  B 2

data

df <-  structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B", 
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q", 
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))

Upvotes: 0

cogitovita
cogitovita

Reputation: 1745

Use merge to join the dataframe with itself and then use aggregate to count within groups. demo:

d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
#       col1.x col1.y col2
# 1      B      A    1
# 2      C      A    3
# 3      A      B    1
# 4      C      B    2
# 5      A      C    3
# 6      B      C    2

Upvotes: 1

Related Questions