Nicholas
Nicholas

Reputation: 93

How to combine rows / add their values according to information in other (data.)table?

I have information stored in two data frames / data.tables and I would like to use the information stored in the second one (my lookup table) to add / collapse rows in my first one.

My first data.table looks something like this:

ID    Sample01    Sample02    Sample03
1        1            2           3
2        0            1           2
3        2            0           1 
4        5            2           1

My second data.table (distance matrix) looks something like this:

0    12    4    1
12    0    3    5
4     3    0    6
1     5    6    0

I would like to add the values of rows in my first data.table, if the attribute in the second, which displays distances, is smaller than 2.

It should then look like this (rows 1 and 4 are added, as the distance is <2):

ID    Sample01    Sample02    Sample03
1        6            4           4
2        0            1           2
3        2            0           1 

As the data is extremely large, I would prefer to work with data.table, but I can also work with other packages, any help is appreciated!

Edit:

This would be a real-world example of table 1 (dput):

structure(list(v_j_cdr3 = c("TCRBV02-01-TCRBJ01-01-SLGGmhcLFF", 
"TCRBV02-01-TCRBJ01-01-SLGhLmhcLFF", "TCRBV02-01-TCRBJ01-01-SLLLGchcLFF", 
"TCRBV02-01-TCRBJ01-01-SLLmGmhcLFF"), HIP08805 = c(6.4439e-05, 
0, 0, 8.69177e-05), HIP17837 = c(6.01681e-05, 0, 0, 0), Keck0006_MC1 = c(2.48385e-05, 
4.96771e-05, 0, 0)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7ffb190066e0>)

This would be a real-world example of table 2 (dput):

structure(list(`1` = c(0, 2, 3, 2), `2` = c(2, 0, 4, 3), `3` = c(3, 
4, 0, 2), `4` = c(2, 3, 2, 0)), row.names = c(NA, -4L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x7ffb190066e0>)

Edit2 : solution (without deletion of rows as the comments made me realize that I should not do that).

diag(matrix) <- 99
pos <- which(matrix<2,arr.ind = T,useNames = F)
datasubset_raw <- datasubset
for(j in 1:nrow(pos)){datasubset[pos[j,1],] <- datasubset[pos[j,1],]+datasubset_raw[pos[j,2],]}

Upvotes: 1

Views: 93

Answers (1)

Obviously you should do a loop for all situations on your dataset, but this idea might help you

df1 <- data.frame("ID"=c(1,2,3,4),"Sample01"=c(1,0,2,5),
              "Sample02"=c(2,1,0,2),"Sample03"=c(3,2,0,1))
# df1
#   ID Sample01 Sample02 Sample03
# 1  1        1        2        3
# 2  2        0        1        2
# 3  3        2        0        0
# 4  4        5        2        1
dist_matrix <- matrix(c(0,12,4,1,
                       12,0,3,5,
                       4,3,0,6,
                       1,5,6,0),nrow = 4,ncol = 4)
#> dist_matrix
#     [,1] [,2] [,3] [,4]
#[1,]    0   12    4    1
#[2,]   12    0    3    5
#[3,]    4    3    0    6
#[4,]    1    5    6    0

Lets assume you are not interested in diagonals for the condition, so it should be > 2

diag(dist_matrix) <- 999

Position with distances < 2 (1 and 4)

pos <- which(dist_matrix<=2,arr.ind = T,useNames = F)
# pos
#      [,1] [,2]
# [1,]    4    1
# [2,]    1    4
for(i in 1:nrow(pos)){pos[i,] <- sort(pos[i,])} # sort
# pos
#     [,1] [,2]
#[1,]    1    4
#[2,]    1    4
pos <- unique(pos) # your final vector
#pos
#     [,1] [,2]
#[1,]    1    4

Sum then up

df1[pos[1],2:4] <- df1[pos[1],2:4]+df1[pos[2],2:4]
#df1
#  ID Sample01 Sample02 Sample03
#1  1        6        4        4
#2  2        0        1        2
#3  3        2        0        0
#4  4        5        2        1

Delete 4th position

df1 <- df1[-pos[2],]
# df1
#   ID Sample01 Sample02 Sample03
# 1  1        6        4        4
# 2  2        0        1        2
# 3  3        2        0        0

Upvotes: 1

Related Questions