Reputation: 25

Performing logical tests against multiple rows in R

I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.

book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1

books chapt.start chapt.end
1            Dune         300       310
2            LOTR           8        19
3            LOTR          94       110
4 OriginOfSpecies         150       158
5       OldManSea         600       630

My second dataframe contains a list of book titles and a single page.

title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2

title page
1            LOTR    4
2            LOTR   12
3            LOTR   30
4 OriginOfSpecies  200
5       OldManSea  620
6       OldManSea  650

What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end

The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE

Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!

Upvotes: 1

Answers (4)

AnilGoyal

Reputation: 26218

using dplyr only i.e. without purrr or fuzzyjoin

df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
  group_by(id, title) %>%
  summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))), 
                             F, 
                             as.logical(sum(chapt.start <= page & page <= chapt.end))))

# A tibble: 5 x 3
# Groups:   id [5]
     id title           desired
  <int> <chr>           <lgl>  
1     1 Dune            FALSE  
2     2 LOTR            TRUE   
3     3 LOTR            FALSE  
4     4 OriginOfSpecies FALSE  
5     5 OldManSea       TRUE

Upvotes: 1

Sinh Nguyen

Reputation: 4487

Another approach using purrr without joining data

Create a logical check variable for df1

library(dplyr)
library(purrr)

# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
  book_record <- tibble(...)
  any_record <- page_df %>%
    filter(title == book_record[["book"]],
      page >= book_record[["chapt.start"]],
      page <= book_record[["chapt.end"]])
  
  nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#>              book chapt.start chapt.end check
#> 1            Dune         300       310 FALSE
#> 2            LOTR           8        19  TRUE
#> 3            LOTR          94       110 FALSE
#> 4 OriginOfSpecies         150       158 FALSE
#> 5       OldManSea         600       630  TRUE

Same logics just did it for df2

# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
  page_record <- tibble(...)
  any_record <- book_chapters_df %>%
    filter(book == page_record[["title"]],
      chapt.start <= page_record[["page"]],
      chapt.end >= page_record[["page"]])
  
  nrow(any_record) > 0
}

# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)

df2
#>             title page check
#> 1            LOTR    4 FALSE
#> 2            LOTR   12  TRUE
#> 3            LOTR   30 FALSE
#> 4 OriginOfSpecies  200 FALSE
#> 5       OldManSea  620  TRUE
#> 6       OldManSea  650 FALSE

^{Created on 2021-04-12 by the reprex package (v1.0.0)}

Upvotes: 0

r2evans

Reputation: 160447

This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page).

sqldf

library(sqldf)
sqldf(
  "select df1.*, df2.page
   from df1
     left join df2 on df1.book=df2.title
       and df2.page between df1.[chapt.start] and df1.[chapt.end]")
#              book chapt.start chapt.end page
# 1            Dune         300       310   NA
# 2            LOTR           8        19   12
# 3            LOTR          94       110   NA
# 4 OriginOfSpecies         150       158   NA
# 5       OldManSea         600       630  620

fuzzyjoin

(Edited out, see @IanCampbell's answer.)

data.table

library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
#              title  page    p2  p2.1
#             <char> <num> <num> <num>
# 1:            Dune    NA   300   310
# 2:            LOTR    12     8    19
# 3:            LOTR    NA    94   110
# 4: OriginOfSpecies    NA   150   158
# 5:       OldManSea   620   600   630

The reason I add p2 as a copy of page is that data.table on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.

Upvotes: 4

Ian Campbell

Reputation: 24790

You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin package:

library(fuzzyjoin)
fuzzy_left_join(df1, df2, 
                by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
                match_fun = c(`==`, `<=`, `>=`))
            books chapt.start chapt.end     title page
1            Dune         300       310      <NA>   NA
2            LOTR           8        19      LOTR   12
3            LOTR          94       110      <NA>   NA
4 OriginOfSpecies         150       158      <NA>   NA
5       OldManSea         600       630 OldManSea  620

From here it's easy to get to the desired output:

library(dplyr)
fuzzy_left_join(df1, df2, 
                by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
                match_fun = c(`==`, `<=`, `>=`)) %>%
  mutate(result = !is.na(page)) %>%
  select(-c(title,page))
            books chapt.start chapt.end result
1            Dune         300       310  FALSE
2            LOTR           8        19   TRUE
3            LOTR          94       110  FALSE
4 OriginOfSpecies         150       158  FALSE
5       OldManSea         600       630   TRUE

Upvotes: 3