Reputation: 27
I want to automatically download all the whitepapers from this website: https://icobench.com/ico, when you choose to enter each ICO's webpage, there's a whitepaper tab to click, which will take you to the pdf preview screen, I want to retrieve the pdf url from the css script by using rvest, but nothing comes back after I tried multiple input on the nodes
A example of one ico's css inspect:
embed id="plugin" type="application/x-google-chrome-pdf"
src="https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf"
stream-url="chrome-extension://mhjfbmdgcfjbbpaeojofohoefgiehjai/9ca6571a-509f-4924-83ef-5ac83e431a37"
headers="content-length: 2629762
content-type: application/pdf
I've tried something like the following:
library(rvest)
url <- "https://icobench.com/ico"
url <- str_c(url, '/hygh')
webpage <- read_html(url)
Item_html <- html_nodes(webpage, "content embed#plugin")
Item <- html_attr(Item_html, "src")
or
Item <- html_text(Item_html)
Item
But nothing comes back, anybody can help?
From above example, I'm expecting to retrieve the embedded url to the ico's official website for pdf whitepapers, eg: https://www.ideafex.com/docs/IdeaFeX_twp_v1.1.pdf
But as it's google chrome plugin, it's not being retrieved by the rvest package, any ideas?
Upvotes: 1
Views: 143
Reputation: 84465
A possible solution:
Using your example I would change the selector to combine, using descendant combinator, id with attribute = value selector. This would target the whitepaper tab by id and the child link by href
attribute value; using $ ends with operator to get the pdf.
library(rvest)
library(magrittr)
url <- "https://icobench.com/ico/hygh"
pdf_link <- read_html(url) %>% html_node(., "#whitepaper [href$=pdf]") %>% html_attr(., "href")
Faster option?
You could also target the object
tag and its data
attribute
pdf_link <- read_html(url) %>% html_node(., "#whitepaper object") %>% html_attr(., "data")
Explore which is fit for purpose across pages.
The latter is likely faster and seems to be used across the few sites I checked.
Solution for all icos:
You could put this in a function that receives an url as input (the url of each ico); the function would return the pdf url, or some other specified value if no url found/css selector fails to match. You'd need to add some handling for that scenario. Then call that function over a loop of all ico urls.
Upvotes: 0