Reputation: 297
Using either rvest or RSelenium when you scrape the links in R, you are able to do it by defining the begining part of HTML code, e.g. a href within given node. What if I face 2 following links:
<a href="some_link" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-featured-tracking="listing_no_promo">
<a href="some_link" data-tracking="click_body" data-tracking-data='{"touch_point_button":"photo"}' data-featured-tracking="listing_promo">
As you may see they differ just with very last part. Do you know how can I grab(define) only links with promo /no promo?
Upvotes: 0
Views: 221
Reputation: 1118
Use xpath
and XML
library:
Assuming that you are looking for the no_promo links:
library(XML)
library(httr)
response <- GET(yourLink)
parsedoc <- htmlParse(response)
xpathSApply(parsedoc, "//a[@data-featured-tracking='listing_no_promo']", xmlGetAttr, "href")
or if you just looking for those links which contains keyword "no_promo" in data-featured-tracking parameter, then the last part would be like:
xpathSApply(parsedoc, "//a[contains(@data-featured-tracking, 'no_promo')]", xmlGetAttr, "href")
Upvotes: 1
Reputation: 5281
So let's define links
to be your object containing the html strings, e.g.
links <- html_children(read_html("https://www.otodom.pl/sprzedaz/mieszkanie/"))
Then you can use regular expressions to match "promo"/"no_promo" within those strings, see
p1 <- grepl("promo", links, fixed = TRUE)
p1
[1] TRUE TRUE
p2 <- grepl("no_promo", links, fixed = TRUE)
p2
[1] FALSE TRUE
So links[p1]
contains all strings containing "promo" (so "no_promo" as well) and links[p2]
contains all strings containing "no_promo". Now all that remains is to subset:
promo <- links[p1-p2] # contains strings with promo but not with no_promo
no.promo <- links[p2] # contains strings with no_promo
Upvotes: 1