stats_noob
stats_noob

Reputation: 5897

Webscraping between two tags

I am trying to scrape the following page(s):

Htpps://mywebsite.com

In particular, I would like to get the name of each entry. I noticed that the text I am interested in is always in (MY TEXT) the middle of these two tags: <div class="title"> <a href="your text"> MY TEXT </a>

I know how to search for these tags individually:

#load libraries 
library(rvest)
library(httr)
library(XML)
library(rvest)

# set up page
url<-"https://www.mywebsite.com"
page <-read_html(url)

#option 1
b = page %>% html_nodes("title")

option1 <- b %>% html_text() %>% strsplit("\\n")

#option 2
b = page %>% html_nodes("a")

option2 <- b %>% html_text() %>% strsplit("\\n")

Is there some way that I could have specified the "html_nodes" argument so that it picked up on "MY TEXT" - i.e. scrape between <div class="title"> and </a> :

 <div class="title"> <a href="your text"> MY TEXT </a>

Upvotes: 0

Views: 114

Answers (2)

HoelR
HoelR

Reputation: 6563

Scraping of pages 1:10

library(tidyverse)
library(rvest)

my_function <- function(page_n) {
  
  cat("Scraping page ", page_n, "\n")
  
  page <- paste0("https://www.dentistsearch.ca/search-doctor/",
    page_n, "?category=0&services=0&province=55&city=&k=") %>% read_html
  
  tibble(title = page %>%
           html_elements(".title a") %>%
           html_text2(),
         adress = page %>%  
           html_elements(".marker") %>% 
           html_text2(),
         page = page_n)
}

df <- map_dfr(1:10, my_function)

Upvotes: 5

Allan Cameron
Allan Cameron

Reputation: 173793

You can use the xpath argument inside html_elements to locate each a tag inside a div with class "title".

Here's a complete reproducible example.

library(rvest)

"https://www.mywebsite.ca/extension1/" %>%
  paste0("2?extension2") %>%
  read_html() %>%
  html_elements(xpath = "//div[@class='title']/a") %>% 
  html_text()

Or to get all entries on the first 10 pages:

library(rvest)

unlist(lapply(1:10, function(page){
"https://www.mywebsite.ca/extension1/" %>%
  paste0(page, "?extension2") %>%
  read_html() %>%
  html_elements(xpath = "//div[@class='title']/a") %>% 
  html_text()}))

Created on 2022-07-26 by the reprex package (v2.0.1)

Upvotes: 2

Related Questions