Reputation: 27
There's a really good example of this problem.
https://www.nest.co.uk/product/anglepoise-original-1227-giant-outdoor-floor-lamp
This is a product with Two options. One is 'Gloss Finish' type, and other one is 'Satin Finish' type.
And I can choose color with those Two types.
- Gloss Finish - Dove Gray / Jet Black / Linen White / Alpine White / Citrus Yellow ...
- Satin Finish - Duck Egg Blue / Blossom Pink / Linen White
And HTML are looks like these.
<select title="Select an option" name="product-option"
id="product-option-selector" class="product-option-selector">
<option value="198841">Gloss Finish</option>
<option value="198857">Satin Finish</option>
</select>
After Select an option, There are detail data about colors.
<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
<meta itemprop="priceCurrency" content="GBP" />
<meta itemprop="name" content="Anglepoise Original 1227 Giant Outdoor Floor Lamp: Dove Grey" />
<meta itemprop="price" content="3,125.00" />
</div><!--meta-->
Not like this product, others are have only one option with One product name. So there was no problem with number of row. I used to extract option data like below.
- Extracting Options
html_nodes(xpath='//*[@id="product-option-selector"]/option')%>%
html_text()
- Extracting Names
html_nodes(".product-options")%>%
html_nodes("[itemprop=name]")%>%
html_attr("content")
But with these, There are difference about number of row, So results are like below.
[Options] [Colors]
1 Gloss Finish Dove Gray
2 Satin Finish Jet Black
3 NA Linen White
4 NA Alpine White
5 NA Citrus Yellow
But I want a result like these,
[Options] [Colors]
1 Gloss Finish Dove Gray
2 Gloss Finish Jet Black
3 Gloss Finish Linen White
4 Gloss Finish Alpine White
5 Gloss Finish Citrus Yellow
6 Satin Finish Duck Egg Blue
7 Satin Finish Blossom Pink
I tried everything I can do. I desperately need your help. Is there any creative Idea with these?
Upvotes: 0
Views: 37
Reputation: 13680
We can do:
library(rvest)
library(purrr)
url <- 'https://www.nest.co.uk/product/anglepoise-original-1227-giant-outdoor-floor-lamp'
page <- read_html(url)
prods <- page %>%
html_nodes('#product-option-selector') %>%
html_nodes('option') %>%
map_df(~{
id = .x %>% html_attr('value')
name = .x %>% html_text()
data.frame(id, name)
})
colors <- page %>%
html_nodes('.option-group') %>%
map_df(~{
id = .x %>% html_attr('rel')
colors = .x %>%
html_nodes('li') %>%
html_text()
data.frame(id, colors = colors)
})
final <- merge(prods, colors)
# final
# id name colors
# 1 198841 Gloss Finish Dove Grey
# 2 198841 Gloss Finish Jet Black
# 3 198841 Gloss Finish Linen White
# 4 198841 Gloss Finish Alpine White
# 5 198841 Gloss Finish Citrus Yellow
# 6 198841 Gloss Finish Crimson Red
# 7 198841 Gloss Finish Fresh Orange
# 8 198841 Gloss Finish Vibrant Magenta
# 9 198841 Gloss Finish Marine Blue
# 10 198857 Satin Finish Duck Egg Blue
# 11 198857 Satin Finish Blossom Pink
# 12 198857 Satin Finish Linen White
# 13 198857 Satin Finish Moon Grey
# 14 198857 Satin Finish Slate Grey
# 15 198857 Satin Finish Jet Black
It basically build two tables, just like you did, and then merge them on the ID.
Note that I chose some slightly different sources, especially for the colors name. Taking them from the li
was simpler as string sanitation is not necessary on those.
How to build the two tables via map_df
is quite complicated to explain, but if needed you can expand on what I gave you. (Hint: It basically perform the function inside ~{ }
for each node, and then combine in a data.frame
)
Hope this helps
Upvotes: 1