Reputation: 25
I'm fairly new in R and struggling to get this. The type of problem I'm trying to address involves one data frame containing books and the start and end page of a particular chapter.
book <- c("Dune", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea")
chapt.start <- c(300, 8, 94, 150, 600)
chapt.end <- c(310, 19, 110, 158, 630)
df1 <- data.frame(books, chapt.start, chapt.end)
df1
books chapt.start chapt.end
1 Dune 300 310
2 LOTR 8 19
3 LOTR 94 110
4 OriginOfSpecies 150 158
5 OldManSea 600 630
My second dataframe contains a list of book titles and a single page.
title <- c("LOTR", "LOTR", "LOTR", "OriginOfSpecies", "OldManSea", "OldManSea")
page <- c(4, 12, 30, 200, 620, 650)
df2 <- data.frame(title, page)
df2
title page
1 LOTR 4
2 LOTR 12
3 LOTR 30
4 OriginOfSpecies 200
5 OldManSea 620
6 OldManSea 650
What I'm trying to ask is for each row in df1 is whether df2 contains any rows with the corresponding book title and the page is within the chapter, i.e. df2$title==df1$book and df2$page>df1$chapt.start and df2$page < df1$chapt.end
The desired output for these data would be FALSE, TRUE, FALSE, FALSE, TRUE
Is this best approached as some kind of for, ifelse loop, sapply, or something different? Thanks for your help people!
Upvotes: 1
Views: 199
Reputation: 26218
using dplyr
only i.e. without purrr
or fuzzyjoin
df2 %>% right_join(df1 %>% mutate(id = row_number()), by = c("title" = "book")) %>%
group_by(id, title) %>%
summarise(desired = ifelse(is.na(as.logical(sum(chapt.start <= page & page <= chapt.end))),
F,
as.logical(sum(chapt.start <= page & page <= chapt.end))))
# A tibble: 5 x 3
# Groups: id [5]
id title desired
<int> <chr> <lgl>
1 1 Dune FALSE
2 2 LOTR TRUE
3 3 LOTR FALSE
4 4 OriginOfSpecies FALSE
5 5 OldManSea TRUE
Upvotes: 1
Reputation: 4487
Another approach using purrr
without joining data
library(dplyr)
library(purrr)
# This function design to take ... which is a row of data from pmap
# And then look up if there is any record match condition define in df2
look_up_check_df1 <- function(..., page_df) {
book_record <- tibble(...)
any_record <- page_df %>%
filter(title == book_record[["book"]],
page >= book_record[["chapt.start"]],
page <= book_record[["chapt.end"]])
nrow(any_record) > 0
}
df1$check <- pmap_lgl(df1, look_up_check_df1, page_df = df2)
df1
#> book chapt.start chapt.end check
#> 1 Dune 300 310 FALSE
#> 2 LOTR 8 19 TRUE
#> 3 LOTR 94 110 FALSE
#> 4 OriginOfSpecies 150 158 FALSE
#> 5 OldManSea 600 630 TRUE
# If the check is for df2 then just need to revised it a bit
look_up_check <- function(..., book_chapters_df) {
page_record <- tibble(...)
any_record <- book_chapters_df %>%
filter(book == page_record[["title"]],
chapt.start <= page_record[["page"]],
chapt.end >= page_record[["page"]])
nrow(any_record) > 0
}
# Run a pmap_lgl which passing each row of df2 into function look_up_check
# and return a vector of logical TRUE/FALSE
df2$check <- pmap_lgl(df2, look_up_check, book_chapters_df = df1)
df2
#> title page check
#> 1 LOTR 4 FALSE
#> 2 LOTR 12 TRUE
#> 3 LOTR 30 FALSE
#> 4 OriginOfSpecies 200 FALSE
#> 5 OldManSea 620 TRUE
#> 6 OldManSea 650 FALSE
Created on 2021-04-12 by the reprex package (v1.0.0)
Upvotes: 0
Reputation: 160447
This is a range-based join. There are three good ways to do this in R. All of these are returning the page number itself instead of true/false, it should be straight-forward to convert to logical with something like !is.na(page)
.
library(sqldf)
sqldf(
"select df1.*, df2.page
from df1
left join df2 on df1.book=df2.title
and df2.page between df1.[chapt.start] and df1.[chapt.end]")
# book chapt.start chapt.end page
# 1 Dune 300 310 NA
# 2 LOTR 8 19 12
# 3 LOTR 94 110 NA
# 4 OriginOfSpecies 150 158 NA
# 5 OldManSea 600 630 620
(Edited out, see @IanCampbell's answer.)
library(data.table)
DT1 <- as.data.table(df1)
DT2 <- as.data.table(df2)
DT2[, p2 := page][DT1, on = .(title == book, p2 >= chapt.start, p2 <= chapt.end)]
# title page p2 p2.1
# <char> <num> <num> <num>
# 1: Dune NA 300 310
# 2: LOTR 12 8 19
# 3: LOTR NA 94 110
# 4: OriginOfSpecies NA 150 158
# 5: OldManSea 620 600 630
The reason I add p2
as a copy of page
is that data.table
on range-joins replaces the left's (inequality) column with those from the right (or something like that), so we'd lose that bit of info.
Upvotes: 4
Reputation: 24790
You're looking for a non-equi join. This can be accomplished in many ways, but I prefer the fuzzyjoin
package:
library(fuzzyjoin)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`))
books chapt.start chapt.end title page
1 Dune 300 310 <NA> NA
2 LOTR 8 19 LOTR 12
3 LOTR 94 110 <NA> NA
4 OriginOfSpecies 150 158 <NA> NA
5 OldManSea 600 630 OldManSea 620
From here it's easy to get to the desired output:
library(dplyr)
fuzzy_left_join(df1, df2,
by = c( "books" = "title" , "chapt.start" = "page", "chapt.end" = "page"),
match_fun = c(`==`, `<=`, `>=`)) %>%
mutate(result = !is.na(page)) %>%
select(-c(title,page))
books chapt.start chapt.end result
1 Dune 300 310 FALSE
2 LOTR 8 19 TRUE
3 LOTR 94 110 FALSE
4 OriginOfSpecies 150 158 FALSE
5 OldManSea 600 630 TRUE
Upvotes: 3