Reputation: 730
I am working on drugbank database, please i need help to extract specific text from the below HTML code:
<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>
i want to have the following as my output text as list object:
B01AC05
B01AC — Platelet aggregation inhibitors excl. heparin
B01A — ANTITHROMBOTIC AGENTS
B01 — ANTITHROMBOTIC AGENTS
B — BLOOD AND BLOOD FORMING ORGANS
I have tried the below function but its not working:
library(XML)
getATC <- function(id){
url <- "http://www.drugbank.ca/drugs/"
dburl <- paste(url, id, sep ="")
tables <- readHTMLTable(dburl, header = F)
table <- tables[['atc-drug-tree']]
table
}
ids <- c("DB00208", "DB00209")
ref <- apply(ids, 1, getATC)
NB: The url can be use to see the actual page i want to parse, the HTML snippet i provided was just and example.
Thanks
Upvotes: 2
Views: 133
Reputation: 269491
Create the URL strings and sapply
them using the getDrugs
function which parses the HTML, extracts the root of the HTML tree, finds the ul
node with the indicated class and returns its parent's text (but only before the first whitespace) followed by the text in each ./li/a
grandchild:
library(XML)
getDrugs <- function(...) {
doc <- htmlTreeParse(..., useInternalNodes = TRUE)
xpathApply(xmlRoot(doc), "//ul[@class='atc-drug-tree']", function(node) {
c(sub("\\s.*", "", xmlValue(xmlParent(node))), # get text before 1st whitespace
xpathSApply(node, "./li/a", xmlValue)) # get text in each ./li/a node
})
}
ids <- c("DB00208", "DB00209")
urls <- paste0("http://www.drugbank.ca/drugs/", ids)
L <- sapply(urls, getDrugs)
giving the following list (one component per URL and a component within each for each drug found in that URL):
> L
$`http://www.drugbank.ca/drugs/DB00208`
$`http://www.drugbank.ca/drugs/DB00208`[[1]]
[1] "B01AC05B01AC"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
$`http://www.drugbank.ca/drugs/DB00209`
$`http://www.drugbank.ca/drugs/DB00209`[[1]]
[1] "A03DA06A03DA"
[2] "A03DA — Synthetic anticholinergic agents in combination with analgesics"
[3] "A03D — ANTISPASMODICS IN COMBINATION WITH ANALGESICS"
[4] "A03 — DRUGS FOR FUNCTIONAL GASTROINTESTINAL DISORDERS"
[5] "A — ALIMENTARY TRACT AND METABOLISM"
$`http://www.drugbank.ca/drugs/DB00209`[[2]]
[1] "A03DA06A03DA"
[2] "G04BD — Drugs for urinary frequency and incontinence"
[3] "G04B — UROLOGICALS"
[4] "G04 — UROLOGICALS"
[5] "G — GENITO URINARY SYSTEM AND SEX HORMONES"
We could create a 5x3 matrix out of the above like this:
simplify2array(do.call(c, L))
And here is a test using the input in the question:
Lines <- '<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>'
getDrugs(Lines, asText = TRUE)
giving:
[[1]]
[1] "B01AC05"
[2] "B01AC — Platelet aggregation inhibitors excl. heparin"
[3] "B01A — ANTITHROMBOTIC AGENTS"
[4] "B01 — ANTITHROMBOTIC AGENTS"
[5] "B — BLOOD AND BLOOD FORMING ORGANS"
Upvotes: 2
Reputation: 2225
readHTMLTable
is not working because it can't read the headers in tables 3 and 4.
url <- "http://www.drugbank.ca/drugs/DB00208"
doc <- htmlParse(readLines(url))
summary(doc)
$nameCounts
td a tr li th span div p strong img table ...
745 399 342 175 159 137 66 49 46 27 27
#errors
readHTMLTable(doc)
readHTMLTable(doc, which=3)
# this works
readHTMLTable(doc, which=3, header=FALSE)
Also, ATC codes is not within a nearby table tag, so you have to use xpath like the other answers here.
xpathSApply(doc, '//ul[@class="atc-drug-tree"]/*', xmlValue)
[1] "B01AC — Platelet aggregation inhibitors excl. heparin" "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS" "B — BLOOD AND BLOOD FORMING ORGANS"
xpathSApply(doc, '//ul[@class="atc-drug-tree"]/../node()[1]', xmlValue)
[1] "B01AC05"
Upvotes: 0
Reputation: 6659
rvest
makes web scraping pretty simple. Here's a solution using it.
library("rvest")
library("stringr")
your_html <- read_html('<table>
<tr>
<td>Text</td>
</tr>
<tr>
<th>ATC Codes</th>
<td>B01AC05
<ul class="atc-drug-tree">
<li><a data-no-turbolink="true" href="/atc/B01AC">B01AC — Platelet aggregation inhibitors excl. heparin</a></li>
<li><a data-no-turbolink="true" href="/atc/B01A">B01A — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B01">B01 — ANTITHROMBOTIC AGENTS</a></li>
<li><a data-no-turbolink="true" href="/atc/B">B — BLOOD AND BLOOD FORMING ORGANS</a></li>
</ul>
</td>
</tr>
<tr>
<td>Text</td>
</tr>
</table>')
your_name <-
your_html %>%
html_nodes(xpath='//th[contains(text(), "ATC Codes")]/following-sibling::td') %>%
html_text() %>%
str_extract(".+(?=\n)")
list_elements <-
your_html %>% html_nodes("li") %>% html_nodes("a") %>% html_text()
your_list <- list()
your_list[[your_name]] <- list_elements
> your_list
$B01AC05
[1] "B01AC — Platelet aggregation inhibitors excl. heparin"
[2] "B01A — ANTITHROMBOTIC AGENTS"
[3] "B01 — ANTITHROMBOTIC AGENTS"
[4] "B — BLOOD AND BLOOD FORMING ORGANS"
Upvotes: 3