Manasi Shah
Manasi Shah

Reputation: 437

calculate pairwise Hamming distance and retain smaller one

I have a matrix:

A = matrix( c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow=3, ncol=3, byrow = TRUE)
dimnames(A) = list(c("Taxa1", "Taxa2", "Taxa3"), c("A1", "B1", "C1"))
df <- data.frame("ID" = c("A1", "B1", "C1"), "Triplicate" = c("T1", "T1", "T1"))

I would like to calculate the Hamming distance between A1 and B1, and between A1 and C1 for one taxa and retain the minimum of the two as the designated value for A1. Then calculate the Hamming distance between B1 and A1 (which we already did) and B1 and C1 and retain the minimum of that value for B1 and so on for C1. Note samples A1, B1, C1 are part of the same Triplicate indicated by T1. I will have similar other Triplicate samples T2, T3 or T4 and I would like to group samples based on the Triplicate column value for calculating pairwise Hamming

Resulting matrix would be:

df$Hamming <- c(2, 2, 2)

As distance between A1, B1 and A1, C1 is 2 so retaining the value as 2.

PS: Simple 1 min description of Hamming distance: https://www.youtube.com/watch?v=P02mJhS9qQ4

Adding exact data I am working with: https://www.dropbox.com/s/wrlwmdipeyhbcok/Hamming.RData?dl=0

Upvotes: 3

Views: 647

Answers (2)

Sathish
Sathish

Reputation: 12703

Bit flip is identified by using xor() function, and the total number of bit flips is found by summing the results from xor() function. I did not optimize the code in hamm_dist_min().

xor(0,0)
# [1] FALSE
xor(1,1)
# [1] FALSE
xor(0,1)
# [1] TRUE
xor(1,0)
# [1] TRUE

As per OP's requirement, the hamming distance is computed for both directions. For example: AB, BA, AC, CA, BC, and CB which forms triplicates as ABC, BCA, and CAB. If you want only one direction, for example: AB, AC, and BC you can use combn() function to setup the column numbers inside the hamm_dist_min() function.

Data:

A = matrix( c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow=3, ncol=3, byrow = TRUE)
dimnames(A) = list(c("Taxa1", "Taxa2", "Taxa3"), c("A1", "B1", "C1"))
df <- data.frame("ID" = c("A1", "B1", "C1"), "Triplicate" = c("T1", "T1", "T1"))

Hamming Distance

# minimum of hamming distance
hamm_dist_min <- function(data)
{
  # setup combinations of column numbers
  n_col <- ncol(data)
  x <- expand.grid(seq_len(n_col), seq_len(n_col))
  x <- x[ x[, 1] != x[, 2], ]
  x <- x[order(x[, 1]), ]
  x <- split(x, cut(x[, 1], breaks = c(0, seq_len(n_col)), labels = colnames(data) ))

  # minimum of hamming distance
  h_d <- unlist(lapply(x, function(y){
    min( colSums(apply(y, 1, function(z) xor(data[, z[1]], data[, z[2]]))))
  }))
  return(h_d)
}

hamm_dist_min(data = A)
# A1 B1 C1 
# 2  2  2

df$Hamming <- hamm_dist_min(data = A)
df
#   ID Triplicate Hamming
# 1 A1         T1       2
# 2 B1         T1       2
# 3 C1         T1       2

Youtube example:

df1 <- matrix( c(0,1,1,1,1,1,0,0), ncol = 2, byrow = FALSE)
colnames(df1) <- LETTERS[1:2]
hamm_dist_min(data = df1)
# A B 
# 3 3 

EDIT: Based on the new dataset added in the question.

Note: If there is only one column for a sample type, then 0 is taken as the hamming distance, because we need at least 2 columns to compute hamming distance. Look at T71 in the triplicate column of df. Instead of the value 0, you can return NA which means Not Available

load("Hamming.RData")

# setup unique colnames pattern
col_list <- unique(unlist(lapply( colnames(A), function(x){
  substr(x = x, start = 1, stop = nchar(x) - 1)
} )
))

# get hamming distance
my_results <- lapply( col_list, function(x){
  cols_x <- grep(x, colnames(A) )
  if(length(cols_x) == 1 ){  # return 0 for one column
    return( setNames( object = rep( 0, length(cols_x)), nm = colnames(A)[cols_x]))
  } else{ # return minimum of hamming distance
    return(hamm_dist_min(data = A[, cols_x]))
  }
})

# get triplicate id
triplicate <- paste0( "T", rep(seq_along(my_results), 
                               lengths(my_results)))

# final data
my_results <- unlist(my_results)
df <- data.frame( SampleID = names( my_results ),
                  Hamming = my_results,
                  Triplicate = triplicate,
                  stringsAsFactors = FALSE )

head(df)
#                        SampleID Hamming Triplicate
# Affy22_MDA_1       Affy22_MDA_1       2         T1
# Affy22_MDA_2       Affy22_MDA_2       2         T1
# Affy22_MDA_3       Affy22_MDA_3       3         T1
# GutRef001_MDA_1 GutRef001_MDA_1       4         T2
# GutRef001_MDA_2 GutRef001_MDA_2       4         T2
# GutRef001_MDA_3 GutRef001_MDA_3       6         T2

Upvotes: 2

eipi10
eipi10

Reputation: 93771

Below is a way to calculate the Hamming distance for each pair of columns. I'm not sure how you want to deal with cases where the distances between pairs of columns are the same. Here I just select the first column alphabetically among all columns with same Hamming distance from a given reference column.

library(e1071)  # For the hamming.distance function
library(tidyverse)

# Get Hamming distance for all pairs of columns in matrix A. 
hd = combn(colnames(A), 2, simplify=FALSE) %>% 
  map_df(function(col) data.frame(col1=col[1], col2=col[2], 
                                  Hamming=hamming.distance(A[,col[1]], A[, col[2]]))) %>% 
  # For a given column, keep only the shortest Hamming distance
  group_by(col1) %>% 
  arrange(Hamming, col2) %>% 
  slice(1) %>%
  ungroup %>% 
  # Add a column to mark which Hamming distance pair we kept
  mutate(pair = paste0(col1, "_", col2))

hd
   col1  col2 Hamming  pair
1    A1    B1       2 A1_B1
2    B1    C1       2 B1_C1

Now join the Hamming distance values to the appropriate ID. First, we stack col1 and col2 from hd and remove repeated ID values, then we join the result to df.

df = df %>% left_join(
  bind_rows(hd %>% select(col1, Hamming, pair),
            hd %>% select(col1=col2, Hamming, pair)) %>% 
    filter(!duplicated(col1)),
  by=c("ID"="col1")
)
  ID Triplicate Hamming  pair
1 A1         T1       2 A1_B1
2 B1         T1       2 B1_C1
3 C1         T1       2 B1_C1

Upvotes: 1

Related Questions