Otto_Shmotto
Otto_Shmotto

Reputation: 23

Create Similarity Matrix

I have a matrix that looks like the following:

col_1 col_2 value

 A      B     2.1

 A      C     1.3

 B      C     4.6

 A      D     1.4

....

I would like to get a similarity matrix:

    A    B    C   D

A    X   2.1  1.3  1.4

B    2.1  X   4.6   ...

C    ...  ...  X    ...

D    ...  ... ...    X

So the row and the column names are A,B,C,D and the it takes the value from the third column and adds it to the matrix The problem is also that the original matrix has a length of approximately 10 000 rows.

Upvotes: 0

Views: 2319

Answers (3)

acylam
acylam

Reputation: 18701

With xtabs and mutate_at. sparse = TRUE turns the output to a sparseMatrix:

library(dplyr)

mat <- df %>%
  mutate_at(1:2, factor, levels = unique(c(levels(.$col_1), levels(.$col_2)))) %>%
  xtabs(value ~ col_1 + col_2, data=., sparse = TRUE)

mat[lower.tri(mat)] <- mat[upper.tri(mat)]

Result:

4 x 4 sparse Matrix of class "dgCMatrix"
     col_2
col_1   A   B   C   D
    A .   2.1 1.3 1.4
    B 2.1 .   4.6 .  
    C 1.3 1.4 .   .  
    D 4.6 .   .   .  

Upvotes: 2

s_baldur
s_baldur

Reputation: 33753

As Roland suggests, you could use dcast():

library(data.table)
dcast(df, col_1 ~ col_2)
##   col_1   B   C   D
## 1     A 2.1 1.3 1.4
## 2     B  NA 4.6  NA

Where:

df <- data.frame(
  col_1 = c("A", "A", "B", "A"), 
  col_2 = c("B","C", "C", "D"), 
  value = c(2.1, 1.3, 4.6, 1.4)
)

Upvotes: 2

Gabriel M
Gabriel M

Reputation: 1514

You can do it the following way. I write the code in Python since no language is specified

#I assume that your data is in a python pandas dataframe called df

df = ..load your data  
list_of_labels = [ 'A','B','C','D' ]
nb_labels = len(list_of_labels)
similarity = np.zeros( (nb_labels,nb_labels) )

for l1, l2, val in zip( df['col_1'] , df['col_2'] , df['value'] ):
    i = list_of_labels.index( l1 )
    j = list_of_labels.index( l2 )
    similarity[i][j] = val


similarity_df = pd.DataFrame(data=similarity, index=list_of_labels, columns=list_of_labels)

Upvotes: 1

Related Questions