Tumbledown
Tumbledown

Reputation: 1917

Extracting specific HTML elements using the xml package in R

I'm using R with the xml package to parse data stored in html files. Using the advice from another question (https://stackoverflow.com/a/1849388/1409652), I am cooking with gas on this using the readHTMLTable function.

I have one problem which is that there are two headers for the html table of interest. One of which isn't being picked up by readHTMLTable and gives some identifying information of the units the data relates to (which vary across all the different html files so I need to read them rather than just putting a default in).

I figure that I need to point a function in the xml package at the html file and target the specific row I want. Unfortunately I don't know which function and my knowledge of the terminology around this isn't great. I've put a sample of the HTML code below but haven't included values as they're sensitive, I can swap in some dummy data and post it if it's helpful. So the header that contains "Period", "Volume", "Tariff" is coming through fine, as is all the data in the table body. The header containing "Unit1", "Unit2 etc. is not coming through.

<thead> 
<tr> 
<th class="center" colspan="1" rowspan="1"></th><th class="center" onmouseover="javascript:Tip('Unit1');" onmouseout="javascript:UnTip('');" colspan="4" rowspan="1">Unit1</th><th class="center" onmouseover="javascript:Tip('Unit2');" onmouseout="javascript:UnTip('');" colspan="4" rowspan="1">Unit2</th><th class="center" onmouseover="javascript:Tip('Unit3');" onmouseout="javascript:UnTip('');" colspan="4" rowspan="1">Unit3</th><th class="center" onmouseover="javascript:Tip('Others');" onmouseout="javascript:UnTip('');" colspan="4" rowspan="1">Others</th> 
</tr><tr> 
<th class="left" colspan="1" rowspan="1">Period</th><th class="left" colspan="1" rowspan="1">Volume</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Tariff</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Volume</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Tariff</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Volume</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Tariff</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Volume</th><th class="left" colspan="1" rowspan="1">%</th><th class="left" colspan="1" rowspan="1">Tariff</th><th class="left" colspan="1" rowspan="1">%</th> 
</tr> 
</thead>
<tbody>…all the data…</tbody>

So in summary, does anyone have any pointers on how to extract the Unit information in the above html using the xml package in R (though happy to use other packages if that's the best way)?

Upvotes: 1

Views: 506

Answers (1)

Tumbledown
Tumbledown

Reputation: 1917

All I needed was some subject knowledge ;-)

using xpath syntax I managed to get a list of the unit names using the following:

xpathSApply(doc, "//th[@class='center']/text()")

Believe I can make this more efficient by addressing those //'s too.

Upvotes: 1

Related Questions