Reputation: 437
I have a matrix:
A = matrix( c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow=3, ncol=3, byrow = TRUE)
dimnames(A) = list(c("Taxa1", "Taxa2", "Taxa3"), c("A1", "B1", "C1"))
df <- data.frame("ID" = c("A1", "B1", "C1"), "Triplicate" = c("T1", "T1", "T1"))
I would like to calculate the Hamming distance between A1 and B1, and between A1 and C1 for one taxa and retain the minimum of the two as the designated value for A1. Then calculate the Hamming distance between B1 and A1 (which we already did) and B1 and C1 and retain the minimum of that value for B1 and so on for C1. Note samples A1, B1, C1 are part of the same Triplicate indicated by T1. I will have similar other Triplicate
samples T2, T3 or T4
and I would like to group samples based on the Triplicate
column value for calculating pairwise Hamming
Resulting matrix would be:
df$Hamming <- c(2, 2, 2)
As distance between A1, B1 and A1, C1 is 2 so retaining the value as 2.
PS: Simple 1 min description of Hamming distance: https://www.youtube.com/watch?v=P02mJhS9qQ4
Adding exact data I am working with: https://www.dropbox.com/s/wrlwmdipeyhbcok/Hamming.RData?dl=0
Upvotes: 3
Views: 647
Reputation: 12703
Bit flip is identified by using xor()
function, and the total number of bit flips is found by summing the results from xor()
function. I did not optimize the code in hamm_dist_min()
.
xor(0,0)
# [1] FALSE
xor(1,1)
# [1] FALSE
xor(0,1)
# [1] TRUE
xor(1,0)
# [1] TRUE
As per OP's requirement, the hamming distance is computed for both directions. For example: AB, BA, AC, CA, BC, and CB which forms triplicates as ABC, BCA, and CAB. If you want only one direction, for example: AB, AC, and BC you can use combn()
function to setup the column numbers inside the hamm_dist_min()
function.
Data:
A = matrix( c(1, 0, 0, 0, 1, 0, 0, 0, 1), nrow=3, ncol=3, byrow = TRUE)
dimnames(A) = list(c("Taxa1", "Taxa2", "Taxa3"), c("A1", "B1", "C1"))
df <- data.frame("ID" = c("A1", "B1", "C1"), "Triplicate" = c("T1", "T1", "T1"))
Hamming Distance
# minimum of hamming distance
hamm_dist_min <- function(data)
{
# setup combinations of column numbers
n_col <- ncol(data)
x <- expand.grid(seq_len(n_col), seq_len(n_col))
x <- x[ x[, 1] != x[, 2], ]
x <- x[order(x[, 1]), ]
x <- split(x, cut(x[, 1], breaks = c(0, seq_len(n_col)), labels = colnames(data) ))
# minimum of hamming distance
h_d <- unlist(lapply(x, function(y){
min( colSums(apply(y, 1, function(z) xor(data[, z[1]], data[, z[2]]))))
}))
return(h_d)
}
hamm_dist_min(data = A)
# A1 B1 C1
# 2 2 2
df$Hamming <- hamm_dist_min(data = A)
df
# ID Triplicate Hamming
# 1 A1 T1 2
# 2 B1 T1 2
# 3 C1 T1 2
df1 <- matrix( c(0,1,1,1,1,1,0,0), ncol = 2, byrow = FALSE)
colnames(df1) <- LETTERS[1:2]
hamm_dist_min(data = df1)
# A B
# 3 3
EDIT: Based on the new dataset added in the question.
Note: If there is only one column for a sample type, then 0 is taken as the hamming distance, because we need at least 2 columns to compute hamming distance. Look at T71 in the triplicate column of df
. Instead of the value 0, you can return NA
which means Not Available
load("Hamming.RData")
# setup unique colnames pattern
col_list <- unique(unlist(lapply( colnames(A), function(x){
substr(x = x, start = 1, stop = nchar(x) - 1)
} )
))
# get hamming distance
my_results <- lapply( col_list, function(x){
cols_x <- grep(x, colnames(A) )
if(length(cols_x) == 1 ){ # return 0 for one column
return( setNames( object = rep( 0, length(cols_x)), nm = colnames(A)[cols_x]))
} else{ # return minimum of hamming distance
return(hamm_dist_min(data = A[, cols_x]))
}
})
# get triplicate id
triplicate <- paste0( "T", rep(seq_along(my_results),
lengths(my_results)))
# final data
my_results <- unlist(my_results)
df <- data.frame( SampleID = names( my_results ),
Hamming = my_results,
Triplicate = triplicate,
stringsAsFactors = FALSE )
head(df)
# SampleID Hamming Triplicate
# Affy22_MDA_1 Affy22_MDA_1 2 T1
# Affy22_MDA_2 Affy22_MDA_2 2 T1
# Affy22_MDA_3 Affy22_MDA_3 3 T1
# GutRef001_MDA_1 GutRef001_MDA_1 4 T2
# GutRef001_MDA_2 GutRef001_MDA_2 4 T2
# GutRef001_MDA_3 GutRef001_MDA_3 6 T2
Upvotes: 2
Reputation: 93771
Below is a way to calculate the Hamming distance for each pair of columns. I'm not sure how you want to deal with cases where the distances between pairs of columns are the same. Here I just select the first column alphabetically among all columns with same Hamming distance from a given reference column.
library(e1071) # For the hamming.distance function
library(tidyverse)
# Get Hamming distance for all pairs of columns in matrix A.
hd = combn(colnames(A), 2, simplify=FALSE) %>%
map_df(function(col) data.frame(col1=col[1], col2=col[2],
Hamming=hamming.distance(A[,col[1]], A[, col[2]]))) %>%
# For a given column, keep only the shortest Hamming distance
group_by(col1) %>%
arrange(Hamming, col2) %>%
slice(1) %>%
ungroup %>%
# Add a column to mark which Hamming distance pair we kept
mutate(pair = paste0(col1, "_", col2))
hd
col1 col2 Hamming pair 1 A1 B1 2 A1_B1 2 B1 C1 2 B1_C1
Now join the Hamming distance values to the appropriate ID
. First, we stack col1
and col2
from hd
and remove repeated ID values, then we join the result to df
.
df = df %>% left_join(
bind_rows(hd %>% select(col1, Hamming, pair),
hd %>% select(col1=col2, Hamming, pair)) %>%
filter(!duplicated(col1)),
by=c("ID"="col1")
)
ID Triplicate Hamming pair 1 A1 T1 2 A1_B1 2 B1 T1 2 B1_C1 3 C1 T1 2 B1_C1
Upvotes: 1