Reputation: 13
I am working on a web scraping project using rvest.
html_text(html_nodes(url, CSS))
extracts data from url
wherever the matching CSS
is found. My problem is that the website I am scraping uses a unique CSS ID
for each listed product (such as ListItem_001_Price
). So 1 CSS
defines exactly 1 item's price and so automated webscraping doesn't work
I can create a vector
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
for all the products' CSS
IDs manually. Is it possible to pass it's individual elements to the html_nodes()
function in one go and so collect the resulting data back as a single vector/dataframe?
How to make it work?
Upvotes: 1
Views: 1556
Reputation: 864
html_nodes()
needs the initial "." to find your tags by css-class. You could manually create
V <- c(".ListItem_001_Price", ".ListItem_002_Price", ".ListItem_003_Price")
like you sugest, but I recommend that you user regex to match the classes like 'ListItem_([0-9]{3})_Price'
so you can avoid the manual labour. Make sure you regex on the actual string of your markup, and not on the html-node object. (see below)
In R, apply(), lapplay(), sapplay() and the like, work much like a short loop. In it you can apply a function to every member of data-type that contains numerous values, like lists, data-frames, matrixes or vectors.
In your case, it's a vector, and a way to beginning to understand how it works is thinking of it like:
sapply(vector, function(x) THING-TO-DO-WITH-ITEM-IN-VECTOR)
In your case, you'd like the thing to do with item in vector to be the fetching of the html_text corresponding to the items in the vector. See the code below for an example:
library(rvest)
# An example piece of html
example_markup <- "<ul>
<li class=\"ListItem_041_Price\">Brush</li>
<li class=\"ListItem_031_Price\">Phone</li>
<li class=\"ListItem_002_Price\">Paper clip</li>
<li class=\"ListItem_012_Price\">Bucket</li>
</ul>"
html <- read_html(example_markup)
# Avoid manual creation of css with regex
regex <- 'ListItem_([0-9]{3})_Price'
# Note that ([0-9]{3}) will match three consecutive numeric characters
price_classes <- regmatches(example_markup, gregexpr(regex, example_markup))[[1]]
# Paste leading "." so that html_nodes() can find the class:
price_classes <- paste(".", price_classes, sep="")
# A singel entry is found like so:
html %>% html_nodes(".ListItem_031_Price") %>% html_text()
# Use sapply to get a named character vector of your products
# Note how ".ListItem_031_Price" from the line above is replaced by x
# which will be each item of price_classes in turn.
products <- sapply(price_classes, function(x) html %>% html_nodes(x) %>% html_text())
The result in products is a named character vector. Use unname(products)
to drop the names.
Upvotes: 0
Reputation: 521249
You can try using lapply
here:
V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
results <- lapply(V, function(x) html_text(html_nodes(url, x)))
I assume here that your nested call to html_text
will in general return a character vector of the text corresponding to the matching nodes, for each item in V
. This would leave you with a list of vectors which you can then access.
Upvotes: 1