MAXWILL
MAXWILL

Reputation: 27

Web Data Scraping issues-Auto Download

I want to automatically download all the whitepapers from this website: https://icobench.com/ico, when you choose to enter each ICO's webpage, there's a whitepaper tab to click, which will take you to the pdf preview screen, I want to retrieve the pdf url from the css script by using rvest, but nothing comes back after I tried multiple input on the nodes

A example of one ico's css inspect:

embed id="plugin" type="application/x-google-chrome-pdf" 
src="https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf" 
stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/9ca6571a-509f-4924-83ef-5ac83e431a37" 
headers="content-length: 2629762
content-type: application/pdf

I've tried something like the following:

library(rvest)
url <- "https://icobench.com/ico"
url <- str_c(url, '/hygh')
webpage <- read_html(url)
Item_html <- html_nodes(webpage, "content embed#plugin")
Item <- html_attr(Item_html, "src")

or

Item <- html_text(Item_html)
Item

But nothing comes back, anybody can help?

From above example, I'm expecting to retrieve the embedded url to the ico's official website for pdf whitepapers, eg: https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf

But as it's google chrome plugin, it's not being retrieved by the rvest package, any ideas?

Upvotes: 1

Views: 143

Answers (1)

QHarr
QHarr

Reputation: 84465

A possible solution:

Using your example I would change the selector to combine, using descendant combinator, id with attribute = value selector. This would target the whitepaper tab by id and the child link by href attribute value; using $ ends with operator to get the pdf.

library(rvest)
library(magrittr)

url <- "https://icobench.com/ico/hygh"
pdf_link <- read_html(url) %>% html_node(., "#whitepaper [href$=pdf]") %>% html_attr(., "href")

Faster option?

You could also target the object tag and its data attribute

pdf_link <- read_html(url) %>% html_node(., "#whitepaper object") %>% html_attr(., "data")

Explore which is fit for purpose across pages.

The latter is likely faster and seems to be used across the few sites I checked.


Solution for all icos:

You could put this in a function that receives an url as input (the url of each ico); the function would return the pdf url, or some other specified value if no url found/css selector fails to match. You'd need to add some handling for that scenario. Then call that function over a loop of all ico urls.

Upvotes: 0

Related Questions