Achal Neupane
Achal Neupane

Reputation: 5719

How to expand data matrix for corresponding column names

I have this data matrix called mymat. It has got .GT columns for samples 00860 and 00861 . I want to expand this matrix with new .AD column. The corresponding .AD columns for each sample will have values 50,0 if .GT is 0/0, 25/25 if .GT is 0/1 and 0,50 if .GT is 1/1. I also want to add another column called .DP next to each column which will have 50 across the column and get the result. How can I do this kind of conditional expansion of matrix in R?

mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), .Dimnames = list(
c("chr1:1163804", "chr1:1888193"
), c("00860.GT", "00861.GT")))

result:

           00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 0/1      25/25       50      0/0     50,0     50
chr1:1888193 1/1      0/50        50      0/0     50,0     50

Upvotes: 0

Views: 477

Answers (2)

aichao
aichao

Reputation: 7435

There may be a better way, but one way to do this using dplyr is:

library(dplyr)

set.AD <- function(x) {                                                   ## 1.
  if (x=="0/0") {
    return("50/0")
  } else if (x=="0/1") {
    return("25/25")
  } else {
    return("0/50")
  }
}
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)                        ## 2.
rnames <- rownames(mymat)
out = mymat %>% group_by(ID)                                              ## 3.
            %>% mutate(`X00860.AD`=set.AD(`X00860.GT`), `X00860.DP`=50,
                       `X00861.AD`=set.AD(`X00861.GT`), `X00861.DP`=50)
out <- data.frame(out[,-1])                                               ## 4.
rownames(out) <- rnames

Notes:

  1. Define a function that sets the AD column from the GT column based on your logic.
  2. Convert your data to a data frame, adding a unique identifier column so that we can apply the function to each row using group_by. Also preserve the row names.
  3. Use mutate to create the AD and DP columns for both the X00860.GT and X00861.GT columns. Note that the conversion to data frame prepends the column names with X because R does not like column names that start with a number. See this SO answer for an explanation.

At this point what is returned is a tibble. Therefore,

  1. Remove the ID column, convert to a data frame, and add back the row names.

The result on your data is:

print(out)
##             X00860.GT X00861.GT X00860.AD X00860.DP X00861.AD X00861.DP
##chr1:1163804       0/1       0/0     25/25        50      50/0        50
##chr1:1888193       1/1       0/0      0/50        50      50/0        50

To reorder the columns to match your output, you can simply:

out <- out[,c(1,3,4,2,5,6)]
##             X00860.GT X00860.AD X00860.DP X00861.GT X00861.AD X00861.DP
##chr1:1163804       0/1     25/25        50       0/0      50/0        50
##chr1:1888193       1/1      0/50        50       0/0      50/0        50

Obviously, this approach can only handle your two columns, but can handle any number of rows.


Edited to handle any number of columns (samples)

Notes are given as comments

# keep column and row names of original mymat to use later
cnames <- colnames(mymat)
rnames <- rownames(mymat)
# since DP columns are always 50, we just create a data frame filled with 50
# to bind to the result as additional columns
dp <- data.frame(matrix(rep(50,ncol(mymat)*nrow(mymat)), nrow=nrow(mymat), ncol=ncol(mymat)))
# set the column name to that of mymat
colnames(dp) <- cnames
# convert to data frame and augment with ID as before
mymat <- data.frame(ID=seq_len(nrow(mymat)),mymat)
# the difference here is that we use mutate_each to apply set.AD to each
# (and all) column of the input. This is done in-place. We then bind the 
# original mymat and dp as columns to this result
out <- mymat %>% group_by(ID) 
             %>% mutate_each(funs(set.AD)) 
             %>% ungroup() %>% select(-ID) 
             %>% bind_cols(mymat[,-1],.) %>% bind_cols(dp)
# At this point, we have the original mymat columns followed by the 
# AD columns followed by the DP columns. The following uses a matrix 
# transpose trick to resort the columns to what you want
col.order <- as.vector(t(matrix(seq_len(ncol(out)), nrow=ncol(mymat)-1, ncol=3)))
out <- data.frame(out[,col.order])
# finally, use gsub to change the column names for the AD and DP columns,
# get rid of the 'X' in the column names, and add back the row names
colnames(out) <- gsub("X", "", gsub("GT.1", "AD", gsub("GT.2", "DP", colnames(out))))
rownames(out) <- rnames
print(out)
##             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
##chr1:1163804      0/1    25/25       50      0/0     50/0       50
##chr1:1888193      1/1     0/50       50      0/0     50/0       50

Hope this helps.

Upvotes: 1

jav
jav

Reputation: 1495

Here's a data.table solution, with each line commented. It is written to handle any number of columns in your mymat object. I will explain briefly:

1) First, we convert to a data.table format where we can handle any number of columns, assuming it will be in a similar format.

2) We find all of the ".GT" columns and extract the number before the ".GT".

3) We create ".DP" columns for each ".GT" column found.

4) We develop a "GT" to "AD" mapping by creating a vector of the "to" part of the mapping. The "from" part is stored as names in the vector.

5) Use the .SDcols feature in the data.table to apply the "GT" to "AD" mapping, and create the "AD" columns.

# Your matrix
mymat <- structure(c("0/1", "1/1", "0/0", "0/0"), .Dim = c(2L, 2L), 
                   .Dimnames = list(c("chr1:1163804", "chr1:1888193"), 
                    c("00860.GT", "00861.GT")))

# Using a data table approach
library(data.table)

# Casting to data table - row.names will be converted to a column called 'rn'.
mymat = as.data.table(mymat, keep.rownames = T)

# Find "GT" columns
GTcols = grep("GT", colnames(mymat))

# Get number before ".GT"
selectedCols = gsub(".GT", "", colnames(mymat)[GTcols])

selectedCols
[1] "00860" "00861"

# Create ".DP" columns
mymat[, paste0(selectedCols, ".DP") := 50, with = F]

mymat
             rn 00860.GT 00861.GT 00860.DP 00861.DP
1: chr1:1163804      0/1      0/0       50       50
2: chr1:1888193      1/1      0/0       50       50

# Create "GT" to "AD" mapping
GTToADMapping = c("50,0", "25/25", "0/50")
names(GTToADMapping) = c("0/0", "0/1", "1/1")

GTToADMapping
0/0     0/1     1/1 
"50,0" "25/25"  "0/50" 

# This function will return the "AD" mapping given the values of "GT"
mapGTToAD <- function(x){
  return (GTToADMapping[x])
}

# Here, we create the AD columns using the GT mapping
mymat[, (paste0(selectedCols, ".AD")) := lapply(.SD, mapGTToAD), with = F,
        .SDcols = colnames(mymat)[GTcols]]

             rn 00860.GT 00861.GT 00860.DP 00861.DP 00860.AD 00861.AD
1: chr1:1163804      0/1      0/0       50       50    25/25     50,0
2: chr1:1888193      1/1      0/0       50       50     0/50     50,0

# We can sort the data now as you have it
colOrder = as.vector(rbind(paste0(selectedCols, ".GT"), 
                     paste0(selectedCols, ".AD"), 
                     paste0(selectedCols, ".DP")))
mymat = mymat[, c("rn", colOrder), with = F]

mymat
             rn 00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
1: chr1:1163804      0/1    25/25       50      0/0     50,0       50
2: chr1:1888193      1/1     0/50       50      0/0     50,0       50

# Put it back in the format you had
mymat2 = as.matrix(mymat[,-1, with = F])
rownames(mymat2) = mymat$rn

mymat2
             00860.GT 00860.AD 00860.DP 00861.GT 00861.AD 00861.DP
chr1:1163804 "0/1"    "25/25"  "50"     "0/0"    "50,0"   "50"    
chr1:1888193 "1/1"    "0/50"   "50"     "0/0"    "50,0"   "50"    

Upvotes: 1

Related Questions