Need to extract following texts which doesn't have a clear xpath with rvest in R

Question

I have a few webpages which I wanted to scrape (html example below). In my example I wanted to get the company name, location, salary, posted date so my approach to get the company name is like this:

library(xml2)
library(rvest)
library(tidyverse)

url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(url)

company_name <- page %>% 
  rvest::html_nodes("li") %>%
  rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
  #rvest::html_nodes("div")%>%
  rvest::html_nodes("span") %>%
  #rvest::html_name()%>%
  rvest::html_text()%>%
  stringr::str_replace_all("[
	]" , "")%>%
  stringr::str_trim()

However this yields:

# [1] "Description"                                                    
# [2] "We are looking for a Skilled, Dynamic, and Collaborative Leader"
# [3] "Mobile Public Library"                                          
# [4] ""                                                               
# [5] "Mobile, Alabama, United States"                                 
# [6] "53812381"                                                       
# [7] "April 21, 2020"                                                 
# [8] "Library Director"                                               
# [9] "Mobile Public Library"                                          
# [10] "Public Library"                                                 
# [11] "Administration/Management"                                      
# [12] "No"                                                             
# [13] "Full-Time"                                                      
# [14] "Indefinite"                                                     
# [15] "Master's Degree"                                                
# [16] "5-7 Years"                                                      
# [17] "0-10%"                                                          
# [18] "Jobs You May Like"

I thought I can get what I want through indexing, but then when I move to next site, the position for some elements change. Like here:

url <- "https://joblist.ala.org/job/ceo-library-director-orange-county-library-system/53673222/"
page <- xml2::read_html(url)

company_name <- page %>% 
  rvest::html_nodes("li") %>%
  rvest::html_nodes(xpath = '//*[@class="clearfix"]') %>%
  #rvest::html_nodes("div")%>%
  rvest::html_nodes("span") %>%
  #rvest::html_name()%>%
  rvest::html_text()%>%
  stringr::str_replace_all("[
	]" , "")%>%
  stringr::str_trim()

Yields:

# [1] "Description"                                           
# [2] "Requirements"                                          
# [3] "Orange County Library System"                          
# [4] ""                                                      
# [5] "Orlando, Florida, 32801, United States"                
# [6] "53673222"                                              
# [7] "April 1, 2020"                                         
# [8] "CEO / Library Director -  Orange County Library System"
# [9] "Orange County Library System"                          
# [10] "Public Library"                                        
# [11] "Administration/Management"                             
# [12] "No"                                                    
# [13] "Full-time"                                             
# [14] "Indefinite"                                            
# [15] "Master's Degree"                                       
# [16] "Over 10 Years"                                         
# [17] "10-25%"                                                
# [18] "$151,882.00 - $160,000.00 (Yearly Salary)"             
# [19] "Jobs You May Like"

Console Inspector looks like this:


  
  Location: 
  

  Orlando, Florida, 32801, United States 

  
  

  
  Job ID: 
  53673222
                             
  
  Posted: 
  April 1, 2020
  


  
  Position Title: 
  CEO / Library Director -  Orange County Library System
  

  
  Company Name: 
  Orange County Library System
  

  
  Library or Company Type: 
  Public Library
  

  
  Job Category: 
  Administration/Management
  

  
  Entry Level: 
  No
  

  

  Job Type: 
  Full-time
  

  

  Job Duration: 
  Indefinite
  

  

  Min Education: 
  Master's Degree
  

   

   Min Experience: 
   Over 10 Years
   

   

   Required Travel: 
   10-25%
   

   

   Salary: 
   $151,882.00 - $160,000.00 (Yearly Salary)

I was wondering if someone can help me out by showing how to get the company name, I can replicate it for others. Not good with HTML. Thank you!

Ronak Shah · Accepted Answer

Since there are no specific classes for each category, we may use regex to extract the information.

library(rvest)

url <- "https://joblist.ala.org/job/library-director/53812381/"
page <- xml2::read_html(URL)

page %>% 
  html_nodes("li") %>%
  html_nodes(xpath = '//*[@class="clearfix"]') %>%
  html_text() %>%
  gsub('[
	]', '', .) %>%
  grep('Company Name:', ., value = TRUE) %>%
  sub('Company Name:', '', .) %>% .[2]

#[1] " Mobile Public Library"

You can extract the information from other categories in the same way. For example, with 'Position Title:' :

page %>% 
  html_nodes("li") %>%
  html_nodes(xpath = '//*[@class="clearfix"]') %>%
  html_text() %>%
  gsub('[
	]', '', .) %>%
  grep('Position Title:', ., value = TRUE) %>%
  sub('Position Title:', '', .) %>% .[2]

#[1] " Library Director"

Probably, you could just write a function and pass strings like "Company Name:" and "Position Title:" to it.

Need to extract following texts which doesn't have a clear xpath with rvest in R

Answers (1)

Related Questions

Need to extract following texts which doesn&#39;t have a clear xpath with rvest in R

Answers (1)

Related Questions

Need to extract following texts which doesn't have a clear xpath with rvest in R