M_D
M_D

Reputation: 297

Scrape website links in R

Using either rvest or RSelenium when you scrape the links in R, you are able to do it by defining the begining part of HTML code, e.g. a href within given node. What if I face 2 following links:

<a href="some_link" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-featured-tracking="listing_no_promo">

<a href="some_link" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-featured-tracking="listing_promo">

As you may see they differ just with very last part. Do you know how can I grab(define) only links with promo /no promo?

Upvotes: 0

Views: 221

Answers (2)

Use xpath and XML library: Assuming that you are looking for the no_promo links:

library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']", xmlGetAttr, "href")

or if you just looking for those links which contains keyword "no_promo" in data-featured-tracking parameter, then the last part would be like:

xpathSApply(parsedoc, "//a[contains(@data-featured-tracking, 'no_promo')]", xmlGetAttr, "href")

Upvotes: 1

niko
niko

Reputation: 5281

So let's define links to be your object containing the html strings, e.g.

 links <- html_children(read_html("https://www.otodom.pl/sprzedaz/mieszkanie/"))

Then you can use regular expressions to match "promo"/"no_promo" within those strings, see

p1 <- grepl("promo", links, fixed = TRUE)
p1
[1] TRUE TRUE
p2 <- grepl("no_promo", links, fixed = TRUE)
p2
[1] FALSE  TRUE

So links[p1] contains all strings containing "promo" (so "no_promo" as well) and links[p2] contains all strings containing "no_promo". Now all that remains is to subset:

promo <- links[p1-p2] # contains strings with promo but not with no_promo
no.promo <- links[p2] # contains strings with no_promo

Upvotes: 1

Related Questions