Reputation: 135
I have a (large) data frame which has a structure relatively similar to this:
id1 id2 symbol1 symbol2 scoreA scoreB scoreC
4790 1120 ABC LLL 1 0 1
2300 4790 NNN ABC 0 0 1
1120 4790 LLL ABC 0 1 1
1120 3120 LLL CCC 0 0 0
I am trying to filter the data frame so that I can every row in which symbol1
and symbol2
match two different strings, this is also being done repeatedly and dynamically so I am searching for the strings as variables.
So in the above example, if I were looking for every instance where the two symbols are ABC
and LLL
, I would output a result like:
id1 id2 symbol1 symbol2 scoreA scoreB scoreC
4790 1120 ABC LLL 1 0 1
1120 4790 LLL ABC 0 1 1
So my issue is that I want to try and search for every row where one of the columns is equal to either of the values AND the other column is equal to the other of the two.
My solution is to do something like the following:
c1_step1 = scores_file[scores_file$symbol1 == in_gene,]
c2_step1 = scores_file[scores_file$symbol1 == end_gene,]
c1_step2 = c1_step1[c1_step1$symbol2 == end_gene,]
c2_step2 = c2_step1[c2_step1$symbol2 == in_gene,]
out_file = rbind(c1_step2, c2_step2)
However, this just feels fairly bulky and inelegant and I am wondering if there is potentially a nicer (and also more easily readable) method of doing something like this? Maybe something using dplyr that I'm not aware of?
Upvotes: 1
Views: 73
Reputation: 101064
With base R
, maybe the following code can reach the objective:
res <- df[which(with(df, match(symbol1,symbol2) & match(symbol2,symbol1))),]
which gives
> res
id1 id2 symbol1 symbol2 scoreA scoreB scoreC
1 4790 1120 ABC LLL 1 0 1
3 1120 4790 LLL ABC 0 1 1
DATA
df <- structure(list(id1 = c(4790L, 2300L, 1120L, 1120L), id2 = c(1120L,
4790L, 4790L, 3120L), symbol1 = c("ABC", "NNN", "LLL", "LLL"),
symbol2 = c("LLL", "ABC", "ABC", "CCC"), scoreA = c(1L, 0L,
0L, 0L), scoreB = c(0L, 0L, 1L, 0L), scoreC = c(1L, 1L, 1L,
0L)), class = "data.frame", row.names = c(NA, -4L))
Upvotes: 1
Reputation: 193507
The best approach would be to use %in%
instead of ==
, like this:
SYM <- c("ABC", "LLL")
library(data.table)
setDT(mydf)[symbol1 %in% SYM & symbol2 %in% SYM]
## id1 id2 symbol1 symbol2 scoreA scoreB scoreC
## 1: 4790 1120 ABC LLL 1 0 1
## 2: 1120 4790 LLL ABC 0 1 1
Or, with "dplyr", you can try either of the following:
library(dplyr)
SYM <- c("ABC", "LLL")
# Option 1
mydf %>%
filter(symbol1 %in% SYM, symbol2 %in% SYM)
# Option 2
mydf %>%
filter_at(vars(symbol1, symbol2), all_vars(. %in% SYM))
Here's the starting data:
mydf <- structure(list(id1 = c(4790L, 2300L, 1120L, 1120L), id2 = c(1120L,
4790L, 4790L, 3120L), symbol1 = c("ABC", "NNN", "LLL", "LLL"),
symbol2 = c("LLL", "ABC", "ABC", "CCC"), scoreA = c(1L, 0L,
0L, 0L), scoreB = c(0L, 0L, 1L, 0L), scoreC = c(1L, 1L, 1L,
0L)), index = structure(integer(0), "`__symbol1`" = c(1L,
3L, 4L, 2L), "`__symbol2`" = c(2L, 3L, 4L, 1L), "`__symbol2__symbol1`" = c(3L,
2L, 4L, 1L)), row.names = c(NA, 4L), class = "data.frame")
As I hinted at in the comments "inelegance" can be pretty subjective. I decided to try your approach out for efficiency, along with my suggestions here, and my original suggestion of using paste
. I also added an option with interaction
which can sometimes be faster than paste
.
library(data.table)
library(dplyr)
x <- rbindlist(replicate(250, mydf, FALSE)) ## 1000 rows
y <- rbindlist(replicate(1000, x, FALSE)) ## 1 million rows
# OP's approach
op_fun <- function(data = x) {
c1_step1 = data[symbol1 == "ABC",]
c2_step1 = data[symbol1 == "LLL",]
c1_step2 = c1_step1[symbol2 == "LLL",]
c2_step2 = c2_step1[symbol2 == "ABC",]
rbind(c1_step2, c2_step2)
}
# data.table
am_fun <- function(data = x, symbs = c("ABC", "LLL")) {
data[symbol1 %in% symbs & symbol2 %in% symbs]
}
# dplyr
am_dplyr_1 <- function(data = x, symbs = c("ABC", "LLL")) {
data %>% filter(symbol1 %in% symbs, symbol2 %in% symbs)
}
am_dplyr_2 <- function(data = x, symbs = c("ABC", "LLL")) {
data %>% filter_at(vars(symbol1, symbol2), all_vars(. %in% symbs))
}
# base R
paste_fun <- function(data = x) {
as.data.table(data)[paste(symbol1, symbol2) %in% c("ABC LLL", "LLL ABC")]
}
interaction_fun <- function(data = x) {
as.data.table(data)[interaction(symbol1, symbol2) %in% c("ABC.LLL", "LLL.ABC")]
}
bench::mark(op_fun(x), am_fun(x), am_dplyr_1(x), am_dplyr_2(x), paste_fun(x), interaction_fun(x),
check = FALSE, time_unit = "ms")
# # A tibble: 6 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl> <list> <list> <list> <list>
# 1 op_fun(x) 4.16 4.28 230. 523.1KB 6.57 105 3 456. <df[,7] [500 × … <df[,3] [112 ×… <bch:t… <tibble [108 × …
# 2 am_fun(x) 1.29 1.36 701. 150.9KB 4.21 333 2 475. <df[,7] [500 × … <df[,3] [31 × … <bch:t… <tibble [335 × …
# 3 am_dplyr_1(x) 0.602 0.627 1565. 62.7KB 8.73 717 4 458. <df[,7] [500 × … <df[,3] [22 × … <bch:t… <tibble [721 × …
# 4 am_dplyr_2(x) 1.45 1.52 645. 66.6KB 6.56 295 3 457. <df[,7] [500 × … <df[,3] [28 × … <bch:t… <tibble [298 × …
# 5 paste_fun(x) 0.403 0.414 2374. 155.6KB 4.20 1130 2 476. <df[,7] [500 × … <df[,3] [33 × … <bch:t… <tibble [1,132 …
# 6 interaction_fun(x) 0.483 0.496 1960. 219.1KB 6.38 922 3 470. <df[,7] [500 × … <df[,3] [46 × … <bch:t… <tibble [925 × …
bench::mark(op_fun(y), am_fun(y), am_dplyr_1(y), am_dplyr_2(y), paste_fun(y), interaction_fun(y),
check = FALSE, time_unit = "ms")
# # A tibble: 6 x 13
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <dbl> <dbl> <dbl> <bch:byt> <dbl> <int> <dbl> <dbl> <list> <list> <list> <list>
# 1 op_fun(y) 67.2 71.9 8.65 98.6MB 10.4 5 6 578. <df[,7] [500,000 … <df[,3] [112 ×… <bch:tm> <tibble [5 ×…
# 2 am_fun(y) 19.8 25.0 39.7 36.4MB 19.8 20 10 504. <df[,7] [500,000 … <df[,3] [31 × … <bch:tm> <tibble [20 …
# 3 am_dplyr_1(y) 33.4 41.4 19.3 59.1MB 13.5 10 7 518. <df[,7] [500,000 … <df[,3] [22 × … <bch:tm> <tibble [10 …
# 4 am_dplyr_2(y) 34.5 43.8 23.3 59.1MB 19.5 12 10 514. <df[,7] [500,000 … <df[,3] [28 × … <bch:tm> <tibble [12 …
# 5 paste_fun(y) 181. 196. 4.38 103MB 5.84 3 4 685. <df[,7] [500,000 … <df[,3] [34 × … <bch:tm> <tibble [3 ×…
# 6 interaction_fun(y) 108. 168. 5.88 164.8MB 10.3 4 7 681. <df[,7] [500,000 … <df[,3] [53 × … <bch:tm> <tibble [4 ×…
As you can see, with smaller data (1000 rows), paste
and interaction
are fast. interaction
seems to scale better than paste
, but it still doesn't compare with your approach or the approach using %in%
that I've suggested here. The data.table
and dplyr
approaches scale the best and remain very readable.
NOTE: I haven't tested the other answers as I don't think they are quite correct.
Upvotes: 2
Reputation: 4367
This might need a little improvement if the data set is large and the function becomes slow. This will generate all combinations of symbols and check if there are at least two rows with matching symbols. All matched symbols will be returned in separate data.frames
in a list
matchedSymbols <- function(data) {
allSymbols <- unique(c(data$symbol1, data$symbol2)) # gets all unique symbols from data set
allCombinations <- combn(allSymbols, 2) # gets all combinations of unique symbols
symbolFlags <- apply(allCombinations, 2, function(x) data$symbol1 == x[1] & data$symbol2 == x[2] | data$symbol1 == x[2] & data$symbol2 == x[1]) # checks which records contain symbol sets
pairFlags <- symbolFlags[, colSums(symbolFlags) > 1] # checks if two rows contain symbols; one means only single match
lapply(seq_len(ncol(pairFlags)), function(x) data[pairFlags[ ,x], ]) # generate list of all matched pairs
}
Pass the entire data.frame
to the function
matchedSymbols(scores_file)
Added dummy data to insert more pairs for testing
df1 <- read.table(
text="id1 id2 symbol1 symbol2 scoreA scoreB scoreC
4790 1120 ABC LLL 1 0 1
2300 4790 NNN ABC 0 0 1
1120 4790 LLL ABC 0 1 1
1120 3120 LLL CCC 0 0 0
1120 3120 XYZ ZYX 0 0 0
1120 3120 ZYX XYZ 0 0 0",
header = TRUE,
stringsAsFactor = FALSE)
> matchedSymbols(df1)
[[1]]
id1 id2 symbol1 symbol2 scoreA scoreB scoreC
1 4790 1120 ABC LLL 1 0 1
3 1120 4790 LLL ABC 0 1 1
[[2]]
id1 id2 symbol1 symbol2 scoreA scoreB scoreC
5 1120 3120 XYZ ZYX 0 0 0
6 1120 3120 ZYX XYZ 0 0 0
Upvotes: 1