O.HENRY
O.HENRY

Reputation: 27

Extract HTML data with different option in R

There's a really good example of this problem.

https://www.nest.co.uk/product/anglepoise-original-1227-giant-outdoor-floor-lamp

This is a product with Two options. One is 'Gloss Finish' type, and other one is 'Satin Finish' type.

And I can choose color with those Two types.

  • Gloss Finish - Dove Gray / Jet Black / Linen White / Alpine White / Citrus Yellow ...
  • Satin Finish - Duck Egg Blue / Blossom Pink / Linen White

And HTML are looks like these.

   <select title="Select an option" name="product-option"
 id="product-option-selector" class="product-option-selector">
                <option value="198841">Gloss Finish</option>
                <option value="198857">Satin Finish</option>
            </select>

After Select an option, There are detail data about colors.

<div itemprop="offers" itemscope itemtype="http://schema.org/Offer">
                    <meta itemprop="priceCurrency" content="GBP" />
                    <meta itemprop="name" content="Anglepoise Original 1227 Giant Outdoor Floor Lamp: Dove Grey" />
                    <meta itemprop="price" content="3,125.00" />
                </div><!--meta-->

Not like this product, others are have only one option with One product name. So there was no problem with number of row. I used to extract option data like below.

  1. Extracting Options
html_nodes(xpath='//*[@id="product-option-selector"]/option')%>%
    html_text()   
  1. Extracting Names
html_nodes(".product-options")%>%
    html_nodes("[itemprop=name]")%>%
    html_attr("content")

But with these, There are difference about number of row, So results are like below.

  [Options]        [Colors]

1  Gloss Finish    Dove Gray
2  Satin Finish    Jet Black
3  NA              Linen White
4  NA              Alpine White
5  NA              Citrus Yellow

But I want a result like these,

  [Options]        [Colors]

1  Gloss Finish    Dove Gray
2  Gloss Finish    Jet Black
3  Gloss Finish    Linen White
4  Gloss Finish    Alpine White
5  Gloss Finish    Citrus Yellow
6  Satin Finish    Duck Egg Blue
7  Satin Finish    Blossom Pink

I tried everything I can do. I desperately need your help. Is there any creative Idea with these?

Upvotes: 0

Views: 37

Answers (1)

GGamba
GGamba

Reputation: 13680

We can do:

library(rvest)
library(purrr)

url <- 'https://www.nest.co.uk/product/anglepoise-original-1227-giant-outdoor-floor-lamp'


page <- read_html(url) 

prods <- page %>% 
    html_nodes('#product-option-selector') %>% 
    html_nodes('option') %>% 
    map_df(~{
        id = .x %>% html_attr('value')
        name = .x %>% html_text()
        data.frame(id, name)
    })

colors <- page %>% 
    html_nodes('.option-group') %>% 
    map_df(~{
        id = .x %>% html_attr('rel')
        colors = .x %>% 
            html_nodes('li') %>% 
            html_text()
        data.frame(id, colors = colors)
    })

final <- merge(prods, colors)

# final
#        id         name           colors
# 1  198841 Gloss Finish       Dove Grey
# 2  198841 Gloss Finish       Jet Black
# 3  198841 Gloss Finish     Linen White
# 4  198841 Gloss Finish    Alpine White
# 5  198841 Gloss Finish   Citrus Yellow
# 6  198841 Gloss Finish     Crimson Red
# 7  198841 Gloss Finish    Fresh Orange
# 8  198841 Gloss Finish Vibrant Magenta
# 9  198841 Gloss Finish     Marine Blue
# 10 198857 Satin Finish   Duck Egg Blue
# 11 198857 Satin Finish    Blossom Pink
# 12 198857 Satin Finish     Linen White
# 13 198857 Satin Finish       Moon Grey
# 14 198857 Satin Finish      Slate Grey
# 15 198857 Satin Finish       Jet Black

It basically build two tables, just like you did, and then merge them on the ID.

Note that I chose some slightly different sources, especially for the colors name. Taking them from the li was simpler as string sanitation is not necessary on those.

How to build the two tables via map_df is quite complicated to explain, but if needed you can expand on what I gave you. (Hint: It basically perform the function inside ~{ } for each node, and then combine in a data.frame)

Hope this helps

Upvotes: 1

Related Questions