Reputation: 259
I am now trying to extract contents between specific html tags, e.g.:
<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22ADB%22&as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22AMS%22&as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67.</dd>
...
</dl>
I plan to extract contents within <h2>
</h2>
and contents within <dd>
and </dd>
. I searched the stackOverFlow for similar questions, but still cannot figure it out, is there anybody who has a simple way to solve this question using R?
Upvotes: 2
Views: 1718
Reputation: 270268
This creates a two column matrix m
whose first column is h2
and whose second column is associated dd
values. Since there is no information in the question on the form of the input we have assumed that the input is a string Lines
but the htmlTreeParse
line can be changed appropriately if not. Try ?htmlTreeParse
for more info.
library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)
f <- function(x) cbind(h2 = xmlValue(x), dd = xpathSApply(x, "//dd", xmlValue))
L <- xpathApply(doc, "//h2", f)
m <- do.call(rbind, L)
Here we display the h2
column and the first 10 characters of the dd
column:
> cbind(h2 = m[,1], dd = substr(m[,2], 1, 10))
h2 dd
[1,] "ADB" "Allgemeine"
[2,] "ADB" "American m"
[3,] "ADB" "Abbott, Ch"
[4,] "AMS" "Allgemeine"
[5,] "AMS" "American m"
[6,] "AMS" "Abbott, Ch"
[7,] "Abbott, C. C. 1861" "Allgemeine"
[8,] "Abbott, C. C. 1861" "American m"
[9,] "Abbott, C. C. 1861" "Abbott, Ch"
This is the input used above:
Lines <- '<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22ADB%22&as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22AMS%22&as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67.</dd>
</dl>'
Upvotes: 5
Reputation: 78832
Or, doing the scraping the proper way:
library(xml2)
library(rvest)
pg <- read_html("https://www.darwinproject.ac.uk/bibliography")
h2 <- html_text(html_nodes(pg, "dt > h2"))
head(h2)
## [1] "ADB" "AMS"
## [3] "Abbott, C. C. 1861" "Abich, O. H. W. 1841"
## [5] "Accum, Frederick. 1820" "Acevedo Moraga, Fernando. 1987"
dd <- html_text(html_nodes(pg, "dd"))
head(dd)
## [1] "Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912."
## [2] "American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27."
## [3] "Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67."
## [4] "Abich, Otto Hermann Wilhelm. 1841. Geologische Betrachtungen über die vulkanischen Erscheinungen und Bildungen in Unter- und Mittel-Italien. Braunschweig."
## [5] "Accum, Frederick. 1820. A treatise on the art of brewing, exhibiting the London practice of brewing porter, brown stout, ale, table beer, and various other kinds of malt liquors. London: Longman, Hurst, Rees, Orme, and Brown."
## [6] "Acevedo Moraga, Fernando. 1987. La Escuela de Minas de la Serena. In La Serena University, edited by Claudo Canut de Bon: 1–18. Chile."
I feel compelled to include a snippet from their ToS:
Subject to statutory allowances, extracts of material from the site may be accessed, downloaded and printed for your personal and non-commercial use and you may draw the attention of others within your organisation to material posted on the site. You may not:
- use any part of the material on the site for direct or indirect commercial purposes or advantage without obtaining a licence to do so from the University or its licensors
- you may not modify or alter the paper or digital copies of any material printed off or downloaded in any way
- sell, resell, license, transfer, transmit, display in any form, perform, hire, lease or loan any content in whole or in part printed or downloaded from the site
- systematically extract and/or re-utilise substantial parts of the content or material on the site
- create and/or publish your own database that features substantial parts of this site.
If you print, copy, download or use any part of the site in breach of these terms of use, your right to use the site will cease immediately and you must at the option of the University return or destroy any copies of the material you have made.
Upvotes: 3
Reputation: 992
htmlpattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(htmlpattern, "\\1", txt)
cat(plain.text)
Note : txt is html text
Upvotes: 1