Juraj
Juraj

Reputation: 13

How to pass vector elements as individual arguments to a function in R

I am working on a web scraping project using rvest.

html_text(html_nodes(url, CSS)) 

extracts data from url wherever the matching CSS is found. My problem is that the website I am scraping uses a unique CSS ID for each listed product (such as ListItem_001_Price). So 1 CSS defines exactly 1 item's price and so automated webscraping doesn't work

I can create a vector

V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")

for all the products' CSS IDs manually. Is it possible to pass it's individual elements to the html_nodes() function in one go and so collect the resulting data back as a single vector/dataframe?

How to make it work?

Upvotes: 1

Views: 1556

Answers (2)

nJGL
nJGL

Reputation: 864

html_nodes() needs the initial "." to find your tags by css-class. You could manually create

V <- c(".ListItem_001_Price", ".ListItem_002_Price", ".ListItem_003_Price")

like you sugest, but I recommend that you user regex to match the classes like 'ListItem_([0-9]{3})_Price' so you can avoid the manual labour. Make sure you regex on the actual string of your markup, and not on the html-node object. (see below)

In R, apply(), lapplay(), sapplay() and the like, work much like a short loop. In it you can apply a function to every member of data-type that contains numerous values, like lists, data-frames, matrixes or vectors.

In your case, it's a vector, and a way to beginning to understand how it works is thinking of it like:

sapply(vector, function(x) THING-TO-DO-WITH-ITEM-IN-VECTOR)

In your case, you'd like the thing to do with item in vector to be the fetching of the html_text corresponding to the items in the vector. See the code below for an example:

library(rvest)
# An example piece of html
example_markup <- "<ul>
<li class=\"ListItem_041_Price\">Brush</li>
<li class=\"ListItem_031_Price\">Phone</li>
<li class=\"ListItem_002_Price\">Paper clip</li>
<li class=\"ListItem_012_Price\">Bucket</li>
</ul>"
html <- read_html(example_markup)


# Avoid manual creation of css with regex
regex <- 'ListItem_([0-9]{3})_Price'
# Note that ([0-9]{3}) will match three consecutive numeric characters
price_classes <- regmatches(example_markup, gregexpr(regex, example_markup))[[1]]
# Paste leading "." so that html_nodes() can find the class:
price_classes <- paste(".", price_classes, sep="")

# A singel entry is found like so:
html %>% html_nodes(".ListItem_031_Price") %>% html_text()

# Use sapply to get a named character vector of your products
# Note how ".ListItem_031_Price" from the line above is replaced by x
# which will be each item of price_classes in turn.
products <- sapply(price_classes, function(x) html %>% html_nodes(x) %>% html_text())

The result in products is a named character vector. Use unname(products) to drop the names.

Upvotes: 0

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521249

You can try using lapply here:

V <- c("ListItem_001_Price", "ListItem_002_Price", "ListItem_003_Price")
results <- lapply(V, function(x) html_text(html_nodes(url, x)))

I assume here that your nested call to html_text will in general return a character vector of the text corresponding to the matching nodes, for each item in V. This would leave you with a list of vectors which you can then access.

Upvotes: 1

Related Questions