Webscraping between two tags

Question

I am trying to scrape the following page(s):

Htpps://mywebsite.com

In particular, I would like to get the name of each entry. I noticed that the text I am interested in is always in (MY TEXT) the middle of these two tags:

MY TEXT

I know how to search for these tags individually:

#load libraries 
library(rvest)
library(httr)
library(XML)
library(rvest)

# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)

#option 1
b = page %>% html_nodes("title")

option1 <- b %>% html_text() %>% strsplit("\n")

#option 2
b = page %>% html_nodes("a")

option2 <- b %>% html_text() %>% strsplit("\n")

Is there some way that I could have specified the "html_nodes" argument so that it picked up on "MY TEXT" - i.e. scrape between

and :

   MY TEXT

HoelR · Accepted Answer

Scraping of pages 1:10

library(tidyverse)
library(rvest)

my_function <- function(page_n) {
  
  cat("Scraping page ", page_n, "
")
  
  page <- paste0("https://www.dentistsearch.ca/search-doctor/",
    page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
  
  tibble(title = page %>%
           html_elements(".title a") %>%
           html_text2(),
         adress = page %>%  
           html_elements(".marker") %>% 
           html_text2(),
         page = page_n)
}

df <- map_dfr(1:10, my_function)

Webscraping between two tags

Answers (2)

Related Questions