Reputation: 129
I have a large data frame with "positive" (1) or "negative" (0) data points.
data example
my_data <- data.frame(cell = 1:4, marker_a = c(1, 0, 0, 0),
marker_b = c(0,1,1,1), marker_c = c(0,1,1,0), marker_d = c(0,1,0,1))
cell marker_a marker_b marker_c marker_d
1 1 1 0 0 0
2 2 0 1 1 1
3 3 0 1 1 0
4 4 0 1 0 1
...
I have a different data.frame
with all the possible combinations of positive and negative markers any my_data$cell
can have
combinations_df <- expand.grid(
marker_a = c(0, 1),
marker_b = c(0, 1),
marker_c = c(0, 1),
marker_d = c(0, 1)
)
marker_a marker_b marker_c marker_d
1 0 0 0 0
2 1 0 0 0
3 0 1 0 0
4 1 1 0 0
5 0 0 1 0
6 1 0 1 0
7 0 1 1 0
8 1 1 1 0
9 0 0 0 1
10 1 0 0 1
11 0 1 0 1
12 1 1 0 1
13 0 0 1 1
14 1 0 1 1
15 0 1 1 1
16 1 1 1 1
How can I get a data.frame
where each row/combination is matched vs every row of my_data and return the final count for each combination
Example of expected output:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 14969 15223 15300 14779 14844 16049 15374 15648 15045 15517 15116 15405 14990 15347 14432 15569
Upvotes: 1
Views: 70
Reputation: 24480
You are writing your combinations in "binary", so no need of any join, but just little math. Try this:
setNames(tabulate(as.matrix(my_data[,2:5])%*%2^(0:3)+1,16),1:16)
# 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# 0 1 0 0 0 0 1 0 0 0 1 0 0 0 1 0
Upvotes: 1
Reputation: 66819
I'm guessing the data.table way is fairly efficient:
library(data.table)
setDT(my_data)
my_data[ combinations_df, on = names(combinations_df), .N, by = .EACHI ]
marker_a marker_b marker_c marker_d N
1: 0 0 0 0 0
2: 1 0 0 0 1
3: 0 1 0 0 0
4: 1 1 0 0 0
5: 0 0 1 0 0
6: 1 0 1 0 0
7: 0 1 1 0 1
8: 1 1 1 0 0
9: 0 0 0 1 0
10: 1 0 0 1 0
11: 0 1 0 1 1
12: 1 1 0 1 0
13: 0 0 1 1 0
14: 1 0 1 1 0
15: 0 1 1 1 1
16: 1 1 1 1 0
If you only care about combinations that show up in the data, "chain" a filtering command:
my_data[ combinations_df, on = names(combinations_df), .N, by = .EACHI ][ N > 0 ]
marker_a marker_b marker_c marker_d N
1: 1 0 0 0 1
2: 0 1 1 0 1
3: 0 1 0 1 1
4: 0 1 1 1 1
Alternately, in this case you don't even need combinations_df
...
my_data[, .N, by = marker_a:marker_d ]
marker_a marker_b marker_c marker_d N
1: 1 0 0 0 1
2: 0 1 1 1 1
3: 0 1 1 0 1
4: 0 1 0 1 1
Upvotes: 1
Reputation: 886938
Perhaps you may need
setNames(sapply(do.call(paste0, combinations_df ),
function(x) sum(do.call(paste0, my_data[-1])==x)), 1:nrow(combinations_df ))
Upvotes: 0