rvest - getting style attribute of any html element

Question

I have a dataframe, of which one column is html. I am trying to deduce whether each row is bold, italic, what font size etc, as I detailed here

Here is some example html:


ITEM 2. PROPERTIES

"

I am now trying to approach this by taking, if present, the style attribute from each row. How do I do this? The closest I've got so far is

html %>% html_nodes(xpath="//div[contains(@style)]")

but it doesn't work because 1) I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value. If possible, I'd like the style element from each HTML string, and then to parse that into things like font size and margin-top. Thanks

QHarr · Accepted Answer

I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value

Just use an attribute selector, for style attribute without specifying an attribute value, on the node/document node

e.g.

all_nodes_with_style <- html %>% html_nodes('[style]')
first_node_with_style <- html %>% html_node('[style]')

NA is returned where not present when accessing the actual value with

html_attr('style')

E.g.

library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:rvest':
#> 
#>     pluck

cases <- c('Me', 'Nope')

map(cases , ~ read_html(.) %>% html_node('[style]') %>% html_attr('style'))
#> [[1]]
#> [1] "display: none;"
#> 
#> [[2]]
#> [1] NA

^{Created on 2021-02-10 by the reprex package (v0.3.0)}

rvest - getting style attribute of any html element

Answers (1)

Related Questions