Reputation: 5281
I am trying to scrape image sources from different website. I used rvest
to do that. The problem I encounter is that I have a vector string
containing the source but I need to extract the source from it.
Here are the first few entries:
> string
{xml_nodeset (100)}
[1] <td class="no-wrap currency-name" data-sort="Bitcoin">\n<img src="https://s2.coinmarketc ...
[2] <td class="no-wrap currency-name" data-sort="Ethereum">\n<img src="https://s2.coinmarket ...
[3] <td class="no-wrap currency-name" data-sort="Ripple">\n <img src="https://s2.coinmarketc ...
What I need is basically the part coming after src="
, so for the first one "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png"
(the console doesn't show the full strings but this what appears after the dots ...
and there comes more stuff after it as well).
Any help is appreciated as I am a bit stuck here.
Upvotes: 1
Views: 251
Reputation: 5281
As pointed out in the comments, a regular expression should do it:
myhtml <- gsub('^.*https://\\s*|\\s*.png.*$', "", string)
myhtml <- paste0("https://", myhtml, ".png")
The first line will extract the part of the string contained between https://
and .png
, and the second one will paste them back into your string in order to have a valid source, i.e. with https://
and .png
at the end.
Upvotes: 1
Reputation: 79238
You can do:
library(rvest)
read_html("https://coinmarketcap.com/coins/")%>%
html_nodes("td img")%>%html_attr("src")
[1] "https://s2.coinmarketcap.com/static/img/coins/16x16/1.png"
[2] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1.png"
[3] "https://s2.coinmarketcap.com/static/img/coins/16x16/1027.png"
[4] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1027.png"
[5] "https://s2.coinmarketcap.com/static/img/coins/16x16/52.png"
[6] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/52.png"
[7] "https://s2.coinmarketcap.com/static/img/coins/16x16/1831.png"
[8] "https://s2.coinmarketcap.com/generated/sparklines/web/7d/usd/1831.png"
:
:
:
:
Upvotes: 2