Reputation: 185
I am transforming HTML into a beautiful and tidy CSV. I have a file full of tables and with few classes. I have three types of tables, and their structure is the same. The only difference is the content within the "th" element which comes after the element in which I am interested. How can I get only the content of the tables that have certain text in "th" ("text_that_I_want_to_get")? Is there a way to insert a class with R inside each type of table?
Type 1 of table
<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">name</th>
<th class="string">mean</th>
<th class="string">stdev</th>
</tr>
</thead>
<tbody>
Type 2 of table
<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">name</th>
<th class="array">answers</th>
</tr>
</thead>
<tbody>
Type 3 of table
<tr>
<th class="array">text_that_I_want_to_get</th>
<td class="array">
<table>
<thead>
<tr>
<th class="string">Reference</th>
</tr>
</thead>
<tbody>
Upvotes: 0
Views: 50
Reputation: 174278
You need the following three xpaths:
xpath1 <- "//td[table[./thead/tr/th/text() = 'stdev']]/preceding-sibling::th"
xpath2 <- "//td[table[./thead/tr/th/text() = 'answers']]/preceding-sibling::th"
xpath3 <- "//td[table[./thead/tr/th/text() = 'Reference']]/preceding-sibling::th"
These find the td
node that is at the root of each of the three table types, then locate the preceding th
sibling with the text you want.
So to get "text_that_I_want_to_get" for table type 1, you do:
read_html(url) %>% html_nodes(xpath = xpath1) %>% html_text()
#> [1] "text_that_I_want_to_get"
And you can do the same with xpath2
and xpath3
to get text from table type 2 and table type 3.
Upvotes: 1