ks123321
ks123321

Reputation: 61

rvest - getting style attribute of any html element

I have a dataframe, of which one column is html. I am trying to deduce whether each row is bold, italic, what font size etc, as I detailed here

Here is some example html:

<div align=\"justify\" style=\"font-size: 10pt; margin-top: 10pt\">\n<b>ITEM 2. PROPERTIES</b>\n</div>"

I am now trying to approach this by taking, if present, the style attribute from each row. How do I do this? The closest I've got so far is

html %>% html_nodes(xpath="//div[contains(@style)]")

but it doesn't work because 1) I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value. If possible, I'd like the style element from each HTML string, and then to parse that into things like font size and margin-top. Thanks

Upvotes: 0

Views: 742

Answers (1)

QHarr
QHarr

Reputation: 84465

I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value

Just use an attribute selector, for style attribute without specifying an attribute value, on the node/document node

e.g.

all_nodes_with_style <- html %>% html_nodes('[style]')
first_node_with_style <- html %>% html_node('[style]')

NA is returned where not present when accessing the actual value with

html_attr('style')

E.g.

library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(purrr)
#> 
#> Attaching package: 'purrr'
#> The following object is masked from 'package:rvest':
#> 
#>     pluck

cases <- c('<a href="#" id="style" style="display: none;">Me</a>', '<a href="#" id="no_style" >Nope</a>')

map(cases , ~ read_html(.) %>% html_node('[style]') %>% html_attr('style'))
#> [[1]]
#> [1] "display: none;"
#> 
#> [[2]]
#> [1] NA

Created on 2021-02-10 by the reprex package (v0.3.0)

Upvotes: 1

Related Questions