Reputation: 651
I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.
Website :Website I want Scraped
Code:
library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']
How I want the dataframe to look:
I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!
Upvotes: 0
Views: 830
Reputation: 2753
Proof-of-concept (based on what I wrote in the comment):
Code
lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)
tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'
url_text <- read_html(burl)
chunks <- url_text %>% html_nodes(tags) %>% html_text()
coordFunc <- function(chunk){
patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
return(ret[[1]][2])
}
longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))
Output
# using 'cat' to make the output easier to read
> cat(chunks[14])
Mt. Laurel DOT
Rt. 38, East
1/4 mile East of Rt. 295
Mt. Laurel Open 24 Hrs
Unleaded / Diesel
856-235-3096Latitude: 39.96744662Longitude: -74.88930386
> longitudes[14]
[1] -74.8893
If you do not coerce longitudes
to be numeric
, you get:
longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"
I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag
you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html
tag. The function could return the extracted values in a data.table
which can then be combined into one using rbindlist
.
You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).
Edit
the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.
Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.
newfunc <- function(chunk){
# Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
# the output is a list so we use 'unlist' to get a vector
# then use 'trimws' to remove whitespace around it - try out each of these functions
# separately to understand what is going on. The final output here is a vector.
txt <- trimws(unlist(strsplit(chunk, '\r\n')))
return(txt)
}
This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:
> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
[1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6
A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc
you could add:
if(length(txt) == 1){
return(NULL)
}
This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:
unleaded
or diesel
. You can grep
on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
html
tags e.g. all coorindates and opening hours are in boldface and have the tag strong
. You can extract those separately and then use regular expressions to get what you want. 24(Hrs|Hours)
to first extract all sites that are open 24 hours and then use selective regex
on the remainder to get their operating times. There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.
Upvotes: 1
Reputation: 667
You can use tidyverse package (stringr, tibble, purrr)
library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
html_nodes(xpath = '//p')%>%
html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks
Upvotes: 0