polo
polo

Reputation: 185

How to get HTML element considering later content of another tag and not the class?

I am transforming HTML into a beautiful and tidy CSV. I have a file full of tables and with few classes. I have three types of tables, and their structure is the same. The only difference is the content within the "th" element which comes after the element in which I am interested. How can I get only the content of the tables that have certain text in "th" ("text_that_I_want_to_get")? Is there a way to insert a class with R inside each type of table?

Type 1 of table

 <tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">name</th>
                    <th class="string">mean</th>
                    <th class="string">stdev</th>
                </tr>
            </thead>
            <tbody>

Type 2 of table

<tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">name</th>
                    <th class="array">answers</th>
                </tr>
            </thead>
            <tbody>

Type 3 of table

<tr>
    <th class="array">text_that_I_want_to_get</th>
    <td class="array">
        <table>
            <thead>
                <tr>
                    <th class="string">Reference</th>
                </tr>
            </thead>
            <tbody>

Upvotes: 0

Views: 50

Answers (1)

Allan Cameron
Allan Cameron

Reputation: 174278

You need the following three xpaths:

xpath1 <- "//td[table[./thead/tr/th/text() = 'stdev']]/preceding-sibling::th"
xpath2 <- "//td[table[./thead/tr/th/text() = 'answers']]/preceding-sibling::th"
xpath3 <- "//td[table[./thead/tr/th/text() = 'Reference']]/preceding-sibling::th"

These find the td node that is at the root of each of the three table types, then locate the preceding th sibling with the text you want.

So to get "text_that_I_want_to_get" for table type 1, you do:

read_html(url) %>% html_nodes(xpath = xpath1) %>% html_text()
#> [1] "text_that_I_want_to_get"

And you can do the same with xpath2 and xpath3 to get text from table type 2 and table type 3.

Upvotes: 1

Related Questions