Lambo
Lambo

Reputation: 897

How to extract data from html into R

I have a link that contents a table. First thing I tried was finding if there is any button to click and unfortunately there isn't. Then I tried to use a package called XML in R to fetch the data between different nodes to build up a data frame by myself.

In order to do this I need to know which node (or HTML tag) I would like to extracting. So I right click on the web browser and find the tag that contains the table I want.enter image description here

From <fieldset id="result" starts the content of the table. We can also see from the browser the first row of the table is <li class="vesselResultEntry removeBackground">.

Then when I was trying to use R to download this HTML, I found the whole <li> tags that relating to the table are gone and replaced by <li class="toRemove"/>. Here is my R code below by the way:

library(XML)
url <- "http://www.fao.org/figis/vrmf/finder/search/#stats"
webpage <- readLines(url)
htmlpage <- htmlParse(webpage, asText = TRUE)
data <- xpathSApply(htmlpage, "//ul[@id='searchResultsContainer']")
data

# <ul id="searchResultsContainer" class="clean resultsContainer"><li class="toRemove"></li></ul> 

What I'm trying to do in the code is simply to see if I can fetch the content in a specific tag. Clearly the row I want to fetch is not in the object (webpage)I saved.

So my questions are:

Is there a way to download the table I want by any means (Ideally in R)?

Is there some kind of protection in this website that prevents me from downloading the whole HTML as a text file and fetch data?

Much appreciate for any suggestions

Upvotes: 0

Views: 1671

Answers (1)

Ouroborus
Ouroborus

Reputation: 16865

The page you're trying to fetch is assembled dynamically on the browser side on load. The content you get by directly fetching the url does not contain the data you see when you view the DOM. That data is loaded later from a separate URL.

I took a look and the URL in question is:

http://www.fao.org/figis/vrmf/finder/services/public/vessels/search?c=true&gd=true&nof=false&not=false&nol=false&ps=30&o=0&user=NOT_SET

I'm not sure what most of the query string is, but it's clear that ps is "page size" and o is "offset". Page size seems to cap at 200 above which it is forced to 30. The URL returns JSON so you'll need some way to parse that. The data embedded in the responses says there are 231047 entries so you'll have to make multiple requests to get it all.

Data providers usually do not appreciate people scraping their data like that. You might want to look around for a downloadable version.

Upvotes: 2

Related Questions