Reputation: 135

Efficient version for searching for two strings in two columns in R

I have a (large) data frame which has a structure relatively similar to this:

id1 id2 symbol1 symbol2 scoreA scoreB scoreC
4790 1120 ABC LLL 1 0 1
2300 4790 NNN ABC 0 0 1
1120 4790 LLL ABC 0 1 1
1120 3120 LLL CCC 0 0 0

I am trying to filter the data frame so that I can every row in which symbol1 and symbol2 match two different strings, this is also being done repeatedly and dynamically so I am searching for the strings as variables.

So in the above example, if I were looking for every instance where the two symbols are ABC and LLL, I would output a result like:

id1 id2 symbol1 symbol2 scoreA scoreB scoreC
4790 1120 ABC LLL 1 0 1
1120 4790 LLL ABC 0 1 1

So my issue is that I want to try and search for every row where one of the columns is equal to either of the values AND the other column is equal to the other of the two.

My solution is to do something like the following:

c1_step1 = scores_file[scores_file$symbol1 == in_gene,]
c2_step1 = scores_file[scores_file$symbol1 == end_gene,]

c1_step2 = c1_step1[c1_step1$symbol2 == end_gene,]
c2_step2 = c2_step1[c2_step1$symbol2 == in_gene,]

out_file = rbind(c1_step2, c2_step2)

However, this just feels fairly bulky and inelegant and I am wondering if there is potentially a nicer (and also more easily readable) method of doing something like this? Maybe something using dplyr that I'm not aware of?

Upvotes: 1

Answers (3)

ThomasIsCoding

Reputation: 101064

With base R, maybe the following code can reach the objective:

res <- df[which(with(df, match(symbol1,symbol2) & match(symbol2,symbol1))),]

which gives

> res
   id1  id2 symbol1 symbol2 scoreA scoreB scoreC
1 4790 1120     ABC     LLL      1      0      1
3 1120 4790     LLL     ABC      0      1      1

DATA

df <- structure(list(id1 = c(4790L, 2300L, 1120L, 1120L), id2 = c(1120L, 
4790L, 4790L, 3120L), symbol1 = c("ABC", "NNN", "LLL", "LLL"), 
    symbol2 = c("LLL", "ABC", "ABC", "CCC"), scoreA = c(1L, 0L, 
    0L, 0L), scoreB = c(0L, 0L, 1L, 0L), scoreC = c(1L, 1L, 1L, 
    0L)), class = "data.frame", row.names = c(NA, -4L))

Upvotes: 1

A5C1D2H2I1M1N2O1R2T1

Reputation: 193507

The best approach would be to use %in% instead of ==, like this:

SYM <- c("ABC", "LLL")

library(data.table)
setDT(mydf)[symbol1 %in% SYM & symbol2 %in% SYM]
##     id1  id2 symbol1 symbol2 scoreA scoreB scoreC
## 1: 4790 1120     ABC     LLL      1      0      1
## 2: 1120 4790     LLL     ABC      0      1      1

Or, with "dplyr", you can try either of the following:

library(dplyr)
SYM <- c("ABC", "LLL")

# Option 1
mydf %>%
  filter(symbol1 %in% SYM, symbol2 %in% SYM)

# Option 2
mydf %>%
  filter_at(vars(symbol1, symbol2), all_vars(. %in% SYM))

Here's the starting data:

mydf <- structure(list(id1 = c(4790L, 2300L, 1120L, 1120L), id2 = c(1120L, 
    4790L, 4790L, 3120L), symbol1 = c("ABC", "NNN", "LLL", "LLL"), 
        symbol2 = c("LLL", "ABC", "ABC", "CCC"), scoreA = c(1L, 0L, 
        0L, 0L), scoreB = c(0L, 0L, 1L, 0L), scoreC = c(1L, 1L, 1L, 
        0L)), index = structure(integer(0), "`__symbol1`" = c(1L, 
    3L, 4L, 2L), "`__symbol2`" = c(2L, 3L, 4L, 1L), "`__symbol2__symbol1`" = c(3L, 
    2L, 4L, 1L)), row.names = c(NA, 4L), class = "data.frame")

As I hinted at in the comments "inelegance" can be pretty subjective. I decided to try your approach out for efficiency, along with my suggestions here, and my original suggestion of using paste. I also added an option with interaction which can sometimes be faster than paste.

The sample data

library(data.table)
library(dplyr)
x <- rbindlist(replicate(250, mydf, FALSE)) ## 1000 rows
y <- rbindlist(replicate(1000, x, FALSE))   ## 1 million rows

The functions being tested

# OP's approach
op_fun <- function(data = x) {
  c1_step1 = data[symbol1 == "ABC",]
  c2_step1 = data[symbol1 == "LLL",]
  c1_step2 = c1_step1[symbol2 == "LLL",]
  c2_step2 = c2_step1[symbol2 == "ABC",]
  rbind(c1_step2, c2_step2)
}

# data.table
am_fun <- function(data = x, symbs = c("ABC", "LLL")) {
  data[symbol1 %in% symbs & symbol2 %in% symbs]
}

# dplyr
am_dplyr_1 <- function(data = x, symbs = c("ABC", "LLL")) {
  data %>% filter(symbol1 %in% symbs, symbol2 %in% symbs)
}
am_dplyr_2 <- function(data = x, symbs = c("ABC", "LLL")) {
  data %>% filter_at(vars(symbol1, symbol2), all_vars(. %in% symbs))
}

# base R
paste_fun <- function(data = x) {
  as.data.table(data)[paste(symbol1, symbol2) %in% c("ABC LLL", "LLL ABC")]
}
interaction_fun <- function(data = x) {
  as.data.table(data)[interaction(symbol1, symbol2) %in% c("ABC.LLL", "LLL.ABC")]
}

The benchmarks

bench::mark(op_fun(x), am_fun(x), am_dplyr_1(x), am_dplyr_2(x), paste_fun(x), interaction_fun(x), 
            check = FALSE, time_unit = "ms")
# # A tibble: 6 x 13
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result           memory          time    gc              
#   <bch:expr>         <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list>           <list>          <list>  <list>          
# 1 op_fun(x)          4.16   4.28       230.   523.1KB     6.57   105     3       456. <df[,7] [500 × … <df[,3] [112 ×… <bch:t… <tibble [108 × …
# 2 am_fun(x)          1.29   1.36       701.   150.9KB     4.21   333     2       475. <df[,7] [500 × … <df[,3] [31 × … <bch:t… <tibble [335 × …
# 3 am_dplyr_1(x)      0.602  0.627     1565.    62.7KB     8.73   717     4       458. <df[,7] [500 × … <df[,3] [22 × … <bch:t… <tibble [721 × …
# 4 am_dplyr_2(x)      1.45   1.52       645.    66.6KB     6.56   295     3       457. <df[,7] [500 × … <df[,3] [28 × … <bch:t… <tibble [298 × …
# 5 paste_fun(x)       0.403  0.414     2374.   155.6KB     4.20  1130     2       476. <df[,7] [500 × … <df[,3] [33 × … <bch:t… <tibble [1,132 …
# 6 interaction_fun(x) 0.483  0.496     1960.   219.1KB     6.38   922     3       470. <df[,7] [500 × … <df[,3] [46 × … <bch:t… <tibble [925 × …

bench::mark(op_fun(y), am_fun(y), am_dplyr_1(y), am_dplyr_2(y), paste_fun(y), interaction_fun(y), 
            check = FALSE, time_unit = "ms")
# # A tibble: 6 x 13
#   expression           min median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time result             memory          time     gc           
#   <bch:expr>         <dbl>  <dbl>     <dbl> <bch:byt>    <dbl> <int> <dbl>      <dbl> <list>             <list>          <list>   <list>       
# 1 op_fun(y)           67.2   71.9      8.65    98.6MB    10.4      5     6       578. <df[,7] [500,000 … <df[,3] [112 ×… <bch:tm> <tibble [5 ×…
# 2 am_fun(y)           19.8   25.0     39.7     36.4MB    19.8     20    10       504. <df[,7] [500,000 … <df[,3] [31 × … <bch:tm> <tibble [20 …
# 3 am_dplyr_1(y)       33.4   41.4     19.3     59.1MB    13.5     10     7       518. <df[,7] [500,000 … <df[,3] [22 × … <bch:tm> <tibble [10 …
# 4 am_dplyr_2(y)       34.5   43.8     23.3     59.1MB    19.5     12    10       514. <df[,7] [500,000 … <df[,3] [28 × … <bch:tm> <tibble [12 …
# 5 paste_fun(y)       181.   196.       4.38     103MB     5.84     3     4       685. <df[,7] [500,000 … <df[,3] [34 × … <bch:tm> <tibble [3 ×…
# 6 interaction_fun(y) 108.   168.       5.88   164.8MB    10.3      4     7       681. <df[,7] [500,000 … <df[,3] [53 × … <bch:tm> <tibble [4 ×…

As you can see, with smaller data (1000 rows), paste and interaction are fast. interaction seems to scale better than paste, but it still doesn't compare with your approach or the approach using %in% that I've suggested here. The data.table and dplyr approaches scale the best and remain very readable.

NOTE: I haven't tested the other answers as I don't think they are quite correct.

Upvotes: 2

manotheshark

Reputation: 4367

This might need a little improvement if the data set is large and the function becomes slow. This will generate all combinations of symbols and check if there are at least two rows with matching symbols. All matched symbols will be returned in separate data.frames in a list

matchedSymbols <- function(data) {
  allSymbols <- unique(c(data$symbol1, data$symbol2))  # gets all unique symbols from data set
  allCombinations <- combn(allSymbols, 2)  # gets all combinations of unique symbols
  symbolFlags <- apply(allCombinations, 2, function(x) data$symbol1 == x[1] & data$symbol2 == x[2] | data$symbol1 == x[2] & data$symbol2 == x[1]) # checks which records contain symbol sets
  pairFlags <- symbolFlags[, colSums(symbolFlags) > 1] # checks if two rows contain symbols; one means only single match
  lapply(seq_len(ncol(pairFlags)), function(x) data[pairFlags[ ,x], ]) # generate list of all matched pairs
}

Pass the entire data.frame to the function

matchedSymbols(scores_file)

Added dummy data to insert more pairs for testing

df1 <- read.table(
  text="id1 id2 symbol1 symbol2 scoreA scoreB scoreC
  4790 1120 ABC LLL 1 0 1
  2300 4790 NNN ABC 0 0 1
  1120 4790 LLL ABC 0 1 1
  1120 3120 LLL CCC 0 0 0
  1120 3120 XYZ ZYX 0 0 0
  1120 3120 ZYX XYZ 0 0 0",
  header = TRUE,
  stringsAsFactor = FALSE)

> matchedSymbols(df1)
[[1]]
   id1  id2 symbol1 symbol2 scoreA scoreB scoreC
1 4790 1120     ABC     LLL      1      0      1
3 1120 4790     LLL     ABC      0      1      1

[[2]]
   id1  id2 symbol1 symbol2 scoreA scoreB scoreC
5 1120 3120     XYZ     ZYX      0      0      0
6 1120 3120     ZYX     XYZ      0      0      0

Upvotes: 1

Efficient version for searching for two strings in two columns in R

Answers (3)

The sample data

The functions being tested

The benchmarks

Related Questions