Reputation: 5719
I have this data matrix called mymat
. It has got .GT
columns for samples 00860
and 00861
. I want to expand this matrix with new .AD
column. The corresponding .AD
columns for each sample will have values 50,0
if .GT
is 0/0
, 25/25
if .GT
is 0/1
and 0,50
if .GT
is 1/1
. I also want to add another column called .DP
next to each column which will have 50
across the column and get the result
. How can I do this kind of conditional expansion of matrix in R?
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193"
), c("00860.GT", "00861.GT")))
result:
00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 0/1 25/25 50 0/0 50,0 50
chr1:1888193 1/1 0/50 50 0/0 50,0 50
Upvotes: 0
Views: 477
Reputation: 7435
There may be a better way, but one way to do this using dplyr
is:
library(dplyr)
set.AD <- function(x) { ## 1.
if (x=="0/0") {
return("50/0")
} else if (x=="0/1") {
return("25/25")
} else {
return("0/50")
}
}
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat) ## 2.
rnames <- rownames(mymat)
out = mymat %>% group_by(ID) ## 3.
%>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50,
`X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50)
out <- data.frame(out[,-1]) ## 4.
rownames(out) <- rnames
Notes:
AD
column from the GT
column based on your logic.group_by
. Also preserve the row names.mutate
to create the AD
and DP
columns for both the X00860.GT
and X00861.GT
columns. Note that the conversion to data frame prepends the column names with X
because R does not like column names that start with a number. See this SO answer for an explanation.At this point what is returned is a tibble
. Therefore,
ID
column, convert to a data frame, and add back the row names.The result on your data is:
print(out)
## X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP
##chr1:1163804 0/1 0/0 25/25 50 50/0 50
##chr1:1888193 1/1 0/0 0/50 50 50/0 50
To reorder the columns to match your output, you can simply:
out <- out[,c(1,3,4,2,5,6)]
## X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP
##chr1:1163804 0/1 25/25 50 0/0 50/0 50
##chr1:1888193 1/1 0/50 50 0/0 50/0 50
Obviously, this approach can only handle your two columns, but can handle any number of rows.
Edited to handle any number of columns (samples)
Notes are given as comments
# keep column and row names of original mymat to use later
cnames <- colnames(mymat)
rnames <- rownames(mymat)
# since DP columns are always 50, we just create a data frame filled with 50
# to bind to the result as additional columns
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat)))
# set the column name to that of mymat
colnames(dp) <- cnames
# convert to data frame and augment with ID as before
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)
# the difference here is that we use mutate_each to apply set.AD to each
# (and all) column of the input. This is done in-place. We then bind the
# original mymat and dp as columns to this result
out <- mymat %>% group_by(ID)
%>% mutate_each(funs(set.AD))
%>% ungroup() %>% select(-ID)
%>% bind_cols(mymat[,-1],.) %>% bind_cols(dp)
# At this point, we have the original mymat columns followed by the
# AD columns followed by the DP columns. The following uses a matrix
# transpose trick to resort the columns to what you want
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3)))
out <- data.frame(out[,col.order])
# finally, use gsub to change the column names for the AD and DP columns,
# get rid of the 'X' in the column names, and add back the row names
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out))))
rownames(out) <- rnames
print(out)
## 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
##chr1:1163804 0/1 25/25 50 0/0 50/0 50
##chr1:1888193 1/1 0/50 50 0/0 50/0 50
Hope this helps.
Upvotes: 1
Reputation: 1495
Here's a data.table solution, with each line commented. It is written to handle any number of columns in your mymat
object. I will explain briefly:
1) First, we convert to a data.table format where we can handle any number of columns, assuming it will be in a similar format.
2) We find all of the ".GT" columns and extract the number before the ".GT".
3) We create ".DP" columns for each ".GT" column found.
4) We develop a "GT" to "AD" mapping by creating a vector of the "to" part of the mapping. The "from" part is stored as names in the vector.
5) Use the .SDcols feature in the data.table to apply the "GT" to "AD" mapping, and create the "AD" columns.
# Your matrix
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L),
.Dimnames = list(c("chr1:1163804", "chr1:1888193"),
c("00860.GT", "00861.GT")))
# Using a data table approach
library(data.table)
# Casting to data table - row.names will be converted to a column called 'rn'.
mymat = as.data.table(mymat, keep.rownames = T)
# Find "GT" columns
GTcols = grep("GT", colnames(mymat))
# Get number before ".GT"
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols])
selectedCols
[1] "00860" "00861"
# Create ".DP" columns
mymat[, paste0(selectedCols, ".DP") := 50, with = F]
mymat
rn 00860.GT 00861.GT 00860.DP 00861.DP
1: chr1:1163804 0/1 0/0 50 50
2: chr1:1888193 1/1 0/0 50 50
# Create "GT" to "AD" mapping
GTToADMapping = c("50,0", "25/25", "0/50")
names(GTToADMapping) = c("0/0", "0/1", "1/1")
GTToADMapping
0/0 0/1 1/1
"50,0" "25/25" "0/50"
# This function will return the "AD" mapping given the values of "GT"
mapGTToAD <- function(x){
return (GTToADMapping[x])
}
# Here, we create the AD columns using the GT mapping
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F,
.SDcols = colnames(mymat)[GTcols]]
rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD
1: chr1:1163804 0/1 0/0 50 50 25/25 50,0
2: chr1:1888193 1/1 0/0 50 50 0/50 50,0
# We can sort the data now as you have it
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"),
paste0(selectedCols, ".AD"),
paste0(selectedCols, ".DP")))
mymat = mymat[, c("rn", colOrder), with = F]
mymat
rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
1: chr1:1163804 0/1 25/25 50 0/0 50,0 50
2: chr1:1888193 1/1 0/50 50 0/0 50,0 50
# Put it back in the format you had
mymat2 = as.matrix(mymat[,-1, with = F])
rownames(mymat2) = mymat$rn
mymat2
00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 "0/1" "25/25" "50" "0/0" "50,0" "50"
chr1:1888193 "1/1" "0/50" "50" "0/0" "50,0" "50"
Upvotes: 1