Reputation: 669
I have a matrix with multiple individuals in rows and multiple nucleotides (values) in columns. It looks like this:
[,1][,2][,3][,4] ...
ind1 a c a a
ind2 a c t t
ind3 a g g c
ind4 a g g g
.
.
.
Now I would like to ignore all columns where only one value occurs (as in the example above the first column) and convert every column with two, three and four (no more than 4 is possible!) different nucleotides (values) into binary format. In the end it should look like this:
[,1] [,2] [,3] ...
ind1 10 100 1000
ind2 10 010 0100
ind3 01 001 0010
ind4 01 001 0001
.
.
.
For me it is only important to get the same binary code for if there are two, three or four different values. I was already calculating how many different values in each column occur, but I am not sure how to change the values to binary format:
df <- apply(df, 2, function(x) length(unique(x)))
Can someone help me?
Upvotes: 2
Views: 376
Reputation: 332
Here is my attempt:
r1 <- c("a","c","a","a")
r2 <- c("a","c","t","t")
r3 <- c("a","g","g","c")
r4 <- c("a","g","g","g")
n.mat <- rbind(r1,r2,r3,r4)
number_to_nucleotide_binary <- function(x,len) {
out <- rep("0",len)
out[x] <- "1"
return(paste(out,collapse = ""))
}
nuc_to_binary <- function(x) {
len <- length(unique(x))
char <- sort(unique(x))
out <- x
if(len != 1) {
pos <- match(x,char)
out <- sapply(X = pos,FUN = function(x) {number_to_nucleotide_binary(x = x,len = len)})
}
return(out)
}
apply(X = n.mat,FUN = nuc_to_binary,MARGIN = 2)
Input:
[,1] [,2] [,3] [,4]
r1 "a" "c" "a" "a"
r2 "a" "c" "t" "t"
r3 "a" "g" "g" "c"
r4 "a" "g" "g" "g"
Output:
[,1] [,2] [,3] [,4]
[1,] "a" "10" "100" "1000"
[2,] "a" "10" "001" "0001"
[3,] "a" "01" "010" "0100"
[4,] "a" "01" "010" "0010"
Upvotes: 1
Reputation: 30474
Here is something else to try. A custom function will take each column through apply
. First, you can create a vector of numeric values corresponding to unique characters in the column (unique
is used, as factor
will otherwise alphabetize the order). A string of zeroes the length of the maximum number will be made, and the character position corresponding to each value will then be substituted with "1".
my_fun <- function(x) {
vec <- as.numeric(factor(x, levels = unique(x)))
vec_max <- max(na.omit(vec))
lapply(vec,
function(y) ifelse(!is.na(y),
sub(paste0("(.{", y - 1, "})."),
"\\11",
paste0(rep("0", vec_max), collapse = "")),
NA))
}
m[] <- matrix(unlist(apply(m, 2, my_fun)))
Output
[,1] [,2] [,3] [,4]
ind1 "1" "10" "100" "1000"
ind2 "1" "10" "010" "0100"
ind3 "1" "01" "001" "0010"
ind4 "1" "01" "001" "0001"
Data
m <- structure(c("a", "a", "a", "a", "c", "c", "g", "g", "a", "t",
"g", "g", "a", "t", "c", "g"), .Dim = c(4L, 4L), .Dimnames = list(
c("ind1", "ind2", "ind3", "ind4"), NULL))
Upvotes: 2