NBE
NBE

Reputation: 651

Creating a dataframe from paragraph text scraped from website in R

I'm trying to scrape a website that has numerous different information I want in paragraphs. I got this to work perfect... However, I don't understand how to break the text up and create a dataframe.

Website :Website I want Scraped

Code:

library(rvest)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"

#Reading the HTML code from the website
webpage <- read_html(url)


p_nodes<-webpage%>%
  html_nodes(xpath = '//p')%>%
  html_text()

#replace multiple whitespaces with single space
p_nodes<- gsub('\\s+',' ',p_nodes)
#trim spaces from ends of elements
p_nodes <- trimws(p_nodes)
#drop blank elements
p_nodes <- p_nodes[p_nodes != '']

How I want the dataframe to look:

enter image description here

I'm not sure if this is even possible. I tried to extract each piece of information separately and then make the dataframe like that but it doesn't work since most of the info is stored in the p tag. I would appreciate any guidance. Thanks!

Upvotes: 0

Views: 830

Answers (2)

Gautam
Gautam

Reputation: 2753

Proof-of-concept (based on what I wrote in the comment):

Code

lapply(c('data.table', 'httr', 'rvest'), library, character.only = T)

tags <- 'tr:nth-child(6) td , tr~ tr+ tr p , td+ p'
burl <- 'https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml'

url_text <- read_html(burl)

chunks <- url_text %>% html_nodes(tags) %>% html_text()

coordFunc <- function(chunk){
  patter_lat <- 'Longitude:.*(-[[:digit:]]{1,2}.[[:digit:]]{0,15})'
  ret <- regmatches(x = chunk, m = regexec(pattern = patter_lat, text = chunk))
  return(ret[[1]][2])
}

longitudes <- as.numeric(unlist(lapply(chunks, coordFunc)))

Output

# using 'cat' to make the output easier to read 
> cat(chunks[14])
Mt.    Laurel DOT
                  Rt. 38, East
                  1/4 mile East of Rt. 295
                  Mt. Laurel Open 24 Hrs
                  Unleaded / Diesel
                  856-235-3096Latitude:  39.96744662Longitude: -74.88930386 


> longitudes[14]
[1] -74.8893

If you do not coerce longitudes to be numeric, you get:

longitudes <- (unlist(lapply(chunks, coordFunc)))
> longitudes[14]
[1] "-74.88930386"

I chose the longitude as a proof-of-concept but you can modify your function to extract all relevant bits in a single call. For getting the right tag you can use SelectorGadget extension (works well in Chrome for me). Alliteratively most browsers let you 'inspect element' to get the html tag. The function could return the extracted values in a data.table which can then be combined into one using rbindlist.

You could even advance pages programatically to scrape the entire website - be sure to check with the usage policy (it's generally frowned upon or restricted to scrape websites).

Edit

the text is not structured the same way throughout the webpage so you'll need to spend more time examining what exceptions can take place.

Here's a new function to resolve each chunk into separate lines and then you can try to use additional regular expressions to get what you want.

newfunc <- function(chunk){
  # Each chunk is a couple of lines. First, we split at '\r\n' using strsplit
  # the output is a list so we use 'unlist' to get a vector 
  # then use 'trimws' to remove whitespace around it - try out each of these functions
  # separately to understand what is going on. The final output here is a vector. 
  txt <- trimws(unlist(strsplit(chunk, '\r\n'))) 
  return(txt)
}

This returns the 'text' contained in each chunk as a vector of separate lines. Taking a look at the number of lines in the first 20 chunks, you can see it is not the same:

> unlist(lapply(chunks[1:20], function(z) length(newfunc(z))))
 [1] 5 6 5 7 5 5 5 5 5 4 1 6 6 6 5 1 1 1 5 6

A good way to resolve this would be to put in a conditional statement based on the number of lines of text in each chunk, e.g. in newfunc you could add:

if(length(txt) == 1){
return(NULL)
}

This is because that is for the entries that don't have any text in them. since this a proof of concept I haven't checked all entries but there's some simple logic:

  1. The first line is typically the name
  2. the coordinates are in the last line
  3. The fuel can be either unleaded or diesel. You can grep on these two strings to see what each depot offers. e.g. grepl('diesel', newfunc(chunks[12]))
  4. Another approach would be to use a different set of html tags e.g. all coorindates and opening hours are in boldface and have the tag strong. You can extract those separately and then use regular expressions to get what you want.
  5. You could search for 24(Hrs|Hours) to first extract all sites that are open 24 hours and then use selective regex on the remainder to get their operating times.

There is no simple easy answer with most web-scraping, you have to find patterns and then apply some logic based on that. Only on the most structured websites will you find something that works for the entire page/range.

Upvotes: 1

Alexandre georges
Alexandre georges

Reputation: 667

You can use tidyverse package (stringr, tibble, purrr)

library(rvest)
library(tidyverse)
url <- "https://www.state.nj.us/treasury/administration/statewide-support/motor-fuel-locations.shtml"
#Reading the HTML code from the website
webpage <- read_html(url)
p_nodes<-webpage%>%
  html_nodes(xpath = '//p')%>%
  html_text()
# Split on new line
l = p_nodes %>% stringr::str_split(pattern = "\r\n")
var1 = sapply(l, `[`, 1) # replace var by the name you want
var2 = sapply(l, `[`, 2)
var3 = sapply(l, `[`, 3)
var4 = sapply(l, `[`, 4)
var5 = sapply(l, `[`, 5)
t = tibble(var1,var2,var3,var4,var5) # make tibble
t = t %>% filter(!is.na(var2)) # delete useless lines
purrr::map_dfr(t,trimws) # delete blanks

Upvotes: 0

Related Questions