rvest scraping html content values

Question

I'm trying to scrape the following page: link in order to create a data frame with 6 columns: position, company and meta (1-5). Unfortunately I don't know how to catch the values in content for example so the value Tauragė would be used in creating my dataframe (in this example).

My initial code:

if(!require("tidyverse")) install.packages("tidyverse"); library("tidyverse")
if(!require("rvest")) install.packages("rvest"); library("rvest")

# setting url and reading html code 
url <- "https://www.cv.lt/employee/announcementsAll.do?regular=true&salaryInterval=-1&interval=2&ipp=1000"
html <- read_html(url, encoding = "utf-8")

# creating a dataframe of ads
ads <- html %>%{
  data.frame(
    position=html_nodes(html, "tbody p a:nth-child(1)") %>% html_text(),
    company=html_nodes(html, "tbody p a:nth-child(2)")%>% html_text(),
    meta1=...
    meta2=...
    meta3=...
    meta4=...
    meta5=... 
)}

an example of html code:


    
        VšĮ Tauragės rajono pirminės sveikatos priežiūros centro direktorius
        Viešoji įstaiga Tauragės rajono pirminės sveikatos priežiūros centras

maydin · Accepted Answer

You can run this,

my_content <- html %>% html_nodes("tbody p meta")  %>%  html_attr("content")

After that, by indexing each of them, you can split them into meta1, meta2,...meta5 like,

index <- rep(1:5,101)
meta <- data.frame(Meta= my_content,Index=index)

meta1 <- meta[meta$Index==1,]
meta2 <- meta[meta$Index==2,]
meta3 <- meta[meta$Index==3,]
meta4 <- meta[meta$Index==4,]
meta5 <- meta[meta$Index==5,]

EDIT :

Another approach is using the itemprop values inside html_nodes()

html %>% html_nodes("[itemprop='jobLocation']") %>% html_attr("content")

gives only the Meta1 for you. If you use the itemprop values for each Meta, you can take the data inside them like,

 meta1 <-    html %>% html_nodes("[itemprop='jobLocation']") %>% html_attr("content") 
 meta2 <-    html %>% html_nodes("[itemprop='datePosted']") %>% html_attr("content") 
 meta3 <-    html %>% html_nodes("[itemprop='employmentType']") %>% html_attr("content") 
 meta4 <-    html %>% html_nodes("[itemprop='validThrough']") %>% html_attr("content") 
 meta5 <-    html %>% html_nodes("[itemprop='url']") %>% html_attr("content")

rvest scraping html content values

Answers (1)

Related Questions