dan
dan

Reputation: 6314

Efficiently collapse a matrix

I have a matrix of this format:

set.seed(1)
mat <- matrix(round(runif(25,0,1)),nrow=5,ncol=5)
colnames(mat) <- c("a1::C","a1::A","a1::B","b1::D","b1::A")

     a1::C a1::A a1::B b1::D b1::A
[1,]     0     1     0     0     1
[2,]     0     1     0     1     0
[3,]     1     1     1     1     1
[4,]     1     1     0     0     0
[5,]     0     0     1     1     0

In words, every column is a subject and a feature (indicated by the column name where they are separated by ::). In every row a value of 1 indicates that subject has that feature and 0 if it doesn't. It is possible that a certain subject will have 0's in all its columns for a specific row.

I want to construct a new matrix where the columns will be the subjects (i.e., one column per subject) and in the rows the features this subject has will appear alphabetically sorted and coma separated. In case a subject doesn't have any feature (i.e., a certain row all with 0's for that subject) a value of "W" should be used (none of the features have a value of "W").

Here's what the new matrix based on mat will look like:

cnames = unique(sapply(colnames(mat), function(x) strsplit(x,split="::")[[1]][1]))
new_mat <- matrix(c("A","A","A,B,C","A,C","B",
                    "A","D","A,D","W","D"),
                  nrow=nrow(mat),ncol=length(cnames))
colnames(new_mat) = cnames

     a1      b1   
[1,] "A"     "A"  
[2,] "A"     "D"  
[3,] "A,B,C" "A,D"
[4,] "A,C"   "W"  
[5,] "B"     "D"

Any idea what would be an efficient and elegant way to achieve this?

Upvotes: 4

Views: 1223

Answers (2)

Zheyuan Li
Zheyuan Li

Reputation: 73315

Step 1: Matrix column pivoting

mat <- mat[, order(colnames(mat))]

#      a1::A a1::B a1::C b1::A b1::D
# [1,]     1     0     0     1     0
# [2,]     1     0     0     0     1
# [3,]     1     1     1     1     1
# [4,]     1     0     1     0     0
# [5,]     0     1     0     0     1

Step 2.1: Column name decomposition

## decompose levels, get main levels (before ::) and sub levels (post ::)
decom <- strsplit(colnames(mat), "::")

main_levels <- sapply(decom, "[", 1)
# [1] "a1" "a1" "a1" "b1" "b1"

sub_levels <- sapply(decom, "[", 2)
# [1] "A" "B" "C" "A" "D"

Step 2.2: Grouping index generation

## generating grouping index
main_index <- paste(rep(main_levels, each = nrow(mat)), rep(1:nrow(mat), times = ncol(mat)), sep = "#")
sub_index <- rep(sub_levels, each = nrow(mat))
sub_index[!as.logical(mat)] <- ""  ## 0 values in mat implies ""

## in unclear of what "main_index" and "sub_index" are, check:

## matrix(main_index, nrow(mat))
#      [,1]   [,2]   [,3]   [,4]   [,5]  
# [1,] "a1#1" "a1#1" "a1#1" "b1#1" "b1#1"
# [2,] "a1#2" "a1#2" "a1#2" "b1#2" "b1#2"
# [3,] "a1#3" "a1#3" "a1#3" "b1#3" "b1#3"
# [4,] "a1#4" "a1#4" "a1#4" "b1#4" "b1#4"
# [5,] "a1#5" "a1#5" "a1#5" "b1#5" "b1#5"

## matrix(sub_index, nrow(mat))
#      [,1] [,2] [,3] [,4] [,5]
# [1,] "A"  ""   ""   "A"  ""  
# [2,] "A"  ""   ""   ""   "D" 
# [3,] "A"  "B"  "C"  "A"  "D" 
# [4,] "A"  ""   "C"  ""   ""  
# [5,] ""   "B"  ""   ""   "D" 

Step 2.3: Conditional collapsed pasting

## collapsed paste of "sub_index" conditional on "main_index"
x <- unname(tapply(sub_index, main_index, paste0, collapse = ""))
x[x == ""] <- "W"
# [1] "A"   "A"   "ABC" "AC"  "B"   "A"   "D"   "AD"  "W"   "D" 

Step 3: Post-processing

I am not very happy with this, but did not find an alternative.

x <- sapply(strsplit(x, ""), paste0, collapse = ",")
#  [1] "A"   "A"   "A,B,C"  "A,C"   "B"   "A"   "D"   "A,D"  "W"  "D"

Step 4: Matrix

x <- matrix(x, nrow = nrow(mat))
colnames(x) <- unique(main_levels)

#      a1      b1   
# [1,] "A"     "A"  
# [2,] "A"     "D"  
# [3,] "A,B,C" "A,D"
# [4,] "A,C"   "W"  
# [5,] "B"     "D" 

Efficiency consideration

The method itself is rather efficient using vectorization, and does not require manual input of grouping information. For example, you can use the same code, when you have even hundreds of main groups (before ::) and hundreds of sub groups (post ::).

The only consideration, is to reduce unnecessary memory copies. In this regard, we should use anonymous function whenever we can, without explicit matrix assignment like what is demonstrated above. This would be good (already tested):

 decom <- strsplit(sort(colnames(mat)), "::")
 main_levels <- sapply(decom, "[", 1)

 sub_index <- rep(sapply(decom, "[", 2), each = nrow(mat))
 sub_index[!as.logical(mat[, order(colnames(mat))])] <- ""

 x <- unname(tapply(sub_index,
                    paste(rep(main_levels, each = nrow(mat)),
                          rep(1:nrow(mat), times = ncol(mat)),
                          sep = "#"),
                    paste0, collapse = ""))

 x <- matrix(sapply(strsplit(x, ""), paste0, collapse = ","),
             nrow = nrow(mat))

 colnames(x) <- unique(main_levels)

Upvotes: 4

akuiper
akuiper

Reputation: 214957

Here is a starting point. Depending on how many variables you have, this might get cumbersome, though.

library(data.table)
dt = data.table(id = seq_len(nrow(mat)), mat)
longDt <- melt(dt, id.vars = "id", measure = patterns("^a1::", "^b1::"))

longDt[, .(a1 = list(sort(c("C", "A", "B")[as.logical(value1)])), 
           b1 = list(sort(c("D", "A")[as.logical(value2)]))), .(id)]
   id    a1  b1
1:  1     A   A
2:  2     A   D
3:  3 A,B,C A,D
4:  4   A,C    
5:  5     B   D

Upvotes: 2

Related Questions