Reputation: 61
I have a dataframe, of which one column is html. I am trying to deduce whether each row is bold, italic, what font size etc, as I detailed here
Here is some example html:
<div align=\"justify\" style=\"font-size: 10pt; margin-top: 10pt\">\n<b>ITEM 2. PROPERTIES</b>\n</div>"
I am now trying to approach this by taking, if present, the style attribute from each row. How do I do this? The closest I've got so far is
html %>% html_nodes(xpath="//div[contains(@style)]")
but it doesn't work because 1) I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value. If possible, I'd like the style element from each HTML string, and then to parse that into things like font size and margin-top. Thanks
Upvotes: 0
Views: 742
Reputation: 84465
I don't want to restrict to div only - I want the style from every row, which may or may not come from div. Also, with "contains", I can't find how to say I want anything where style is present, rather than being equal to a specific value
Just use an attribute selector, for style attribute without specifying an attribute value, on the node/document node
e.g.
all_nodes_with_style <- html %>% html_nodes('[style]')
first_node_with_style <- html %>% html_node('[style]')
NA is returned where not present when accessing the actual value with
html_attr('style')
E.g.
library(rvest)
#> Loading required package: xml2
#> Warning: package 'xml2' was built under R version 4.0.3
library(purrr)
#>
#> Attaching package: 'purrr'
#> The following object is masked from 'package:rvest':
#>
#> pluck
cases <- c('<a href="#" id="style" style="display: none;">Me</a>', '<a href="#" id="no_style" >Nope</a>')
map(cases , ~ read_html(.) %>% html_node('[style]') %>% html_attr('style'))
#> [[1]]
#> [1] "display: none;"
#>
#> [[2]]
#> [1] NA
Created on 2021-02-10 by the reprex package (v0.3.0)
Upvotes: 1