user1389960
user1389960

Reputation: 433

Extract html table immediately following specified text

I am trying to scrape an html table from a webpage. However, the page contains many html tables that I do not want to scrape. To identify the table that I want to scrape, I would like to use the first table that follows a specific combination of words (the word combination is not in the table but is part of the text). Here is an example:

This is the table I am interested in:

library(XML)
url <- "http://www.sec.gov/Archives/edgar/data/1301063/000119312514133663/0001193125-14-133663.txt"
readHTMLTable(url, trim = T, header = F, stringsAsFactors = F)[29]

The criterion that I'd like to use to detect the table is that it is the first table that follows this word combination:

"safety, health, environmental and sustainability challenges"

html <- getURL(url, followlocation = TRUE)
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
grep("safety, health, environmental and sustainability challenges", text, value = T)

Upvotes: 1

Views: 63

Answers (1)

bgoldst
bgoldst

Reputation: 35324

I think this is what you're looking for:

xpathSApply(doc,'//text()[contains(.,"safety, health, environmental and sustainability challenges")]/following::table[1]');
## <table cellspacing="0" cellpadding="0" width="100%" border="0" style="BORDER-COLLAPSE:COLLAPSE" align="center">
##   <tr><td width="48%"/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/>
## <td valign="bottom" width="12%"/>
## <td/>
## <td/>
## <td/></tr>
##   <tr><td valign="bottom" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"> <p style="margin-top:0px;margin-bottom:1px" align="center"><font style="font-family:Times New Roman" size="1"><b>Name</b></font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Audit<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Compensation<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Nominating and<br/>Corporate<br/>Governance<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom" colspan="2" nowrap="nowrap" align="center" style="border-bottom:1px solid #000000"><font style="font-family:Times New Roman" size="1"><b>Safety, Health,<br/>Environmental and<br/>Sustainability<br/>Committee</b></font></td>
## <td valign="bottom"><font size="1"> </font></td></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Kevin S. Crutchfield</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(1)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Angelo C. Brisimitzakis</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">William J. Crowley, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">E. Linn Draper, Jr.</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Glenn A. Eisenberg</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(2)</sup></font><font style="font-family:Times New Roman" size="2"/></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Deborah M. Fretz</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">P. Michael Giftos</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td></tr>
##   <tr><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">L. Patrick Hassey</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/></tr>
##   <tr bgcolor="#cceeff"><td valign="top"> <p style="margin-left:1.00em; text-indent:-1.00em"><font style="font-family:Times New Roman" size="2">Joel Richards, III</font></p></td>
## <td valign="bottom"><font size="1">  </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex"/></font><font style="font-family:Times New Roman" size="2"/></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2"/><font style="font-family:Times New Roman" size="1"><sup style="vertical-align:baseline; position:relative; bottom:.8ex">(3)</sup></font><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"/>
## <td valign="bottom"><font size="1"> </font></td>
## <td valign="bottom"><font style="font-family:Times New Roman" size="2"> </font></td>
## <td valign="bottom" align="right"><font style="font-family:Times New Roman" size="2">X</font></td>
## <td nowrap="nowrap" valign="bottom"><font style="font-family:Times New Roman" size="2">  </font></td></tr>
## </table>

Upvotes: 2

Related Questions