How to get HTML element considering later content of another tag and not the class?

Question

I am transforming HTML into a beautiful and tidy CSV. I have a file full of tables and with few classes. I have three types of tables, and their structure is the same. The only difference is the content within the "th" element which comes after the element in which I am interested. How can I get only the content of the tables that have certain text in "th" ("text_that_I_want_to_get")? Is there a way to insert a class with R inside each type of table?

Type 1 of table

 
    text_that_I_want_to_get
    
        Type 2 of table

            
                
                    name
                    mean
                    stdev
                
            
            





    text_that_I_want_to_get
    
        Type 3 of table

            
                
                    name
                    answers
                
            
            





    text_that_I_want_to_get
    
        
            
                
                    Reference

Allan Cameron · Accepted Answer

You need the following three xpaths:

xpath1 <- "//td[table[./thead/tr/th/text() = 'stdev']]/preceding-sibling::th"
xpath2 <- "//td[table[./thead/tr/th/text() = 'answers']]/preceding-sibling::th"
xpath3 <- "//td[table[./thead/tr/th/text() = 'Reference']]/preceding-sibling::th"

These find the td node that is at the root of each of the three table types, then locate the preceding th sibling with the text you want.

So to get "text_that_I_want_to_get" for table type 1, you do:

read_html(url) %>% html_nodes(xpath = xpath1) %>% html_text()
#> [1] "text_that_I_want_to_get"

And you can do the same with xpath2 and xpath3 to get text from table type 2 and table type 3.

How to get HTML element considering later content of another tag and not the class?

Answers (1)

Related Questions