Reputation: 13
I have a simple problem (seemingly) but have not yet able to find an appropriately quick/time & resource efficient solution. This is a problem in R-Software. My data is of format:
INPUT
col1 col2
A q
C w
B e
A r
A t
A y
C q
B w
C e
C r
B t
C y
DESIRED OUTPUT
unit1 unit2 same_col2_freq
A B 1
A C 3
B A 1
B C 2
C A 3
C B 2
That is in input A has occurred in col1 with q, r, t, y occurring in col2. Now, q, r, t, y occurs for B with t so the A-B combination has count 1. B has occurred in col1 with e, w, t occurring in col2. Now, e, w, t occurs for C with w, t so the B-C combination has count 2. .... and so on for all combinations in col1.
I have done it using a for loop but it is slow. I am picking unique elements from col1 and then, all the data is iterated for each element of col1. Then I am combining the results using rbind. This is slow and resource costly.
I am looking for an efficient method. Maybe a library, function etc. exists that I am unaware of. I tried using co-occurrence matrix but the number of elements in col1 is of order of ~10,000 and it does not solve my purpose.
Any help is greatly appreciated.
Thanks!
Upvotes: 0
Views: 275
Reputation: 887981
Here is a similar approach (as showed by @cogitovita), but using data.table
. Convert the "data.frame" to "data.table" using setDT
, then Cross Join (CJ
) the unique elements of "col1", grouped by "col2". Subset the rows of the output columns that are not equal (V1!=V2
), get the count (.N
), grouped by the new columns (.(V1, V2)
) and finally order the columns (order(V1,V2)
)
library(data.table)
setDT(df)[,CJ(unique(col1), unique(col1)), col2][V1!=V2,
.N, .(V1,V2)][order(V1,V2)]
# V1 V2 N
#1: A B 1
#2: A C 3
#3: B A 1
#4: B C 2
#5: C A 3
#6: C B 2
df <- structure(list(col1 = c("A", "C", "B", "A", "A", "A", "C", "B",
"C", "C", "B", "C"), col2 = c("q", "w", "e", "r", "t", "y", "q",
"w", "e", "r", "t", "y")), .Names = c("col1", "col2"), class =
"data.frame", row.names = c(NA, -12L))
Upvotes: 0
Reputation: 1745
Use merge
to join the dataframe with itself and then use aggregate
to count within groups. demo:
d = data.frame(col1=c("A", "C", "B", "A", "A", "A", "C", "B", "C", "C", "B", "C"), col2=c("q", "w", "e", "r", "t", "y", "q", "w", "e", "r", "t", "y"))
dm = merge(d, d, by="col2")
dm = dm[dm[,'col1.x']!=dm[,'col1.y'],]
aggregate(col2 ~ col1.x + col1.y, data=dm, length)
# col1.x col1.y col2
# 1 B A 1
# 2 C A 3
# 3 A B 1
# 4 C B 2
# 5 A C 3
# 6 B C 2
Upvotes: 1