Reputation: 259

Extract contents within html tags using R

I am now trying to extract contents between specific html tags, e.g.:

<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
...
</dl>

link

I plan to extract contents within <h2> </h2> and contents within <dd> and </dd>. I searched the stackOverFlow for similar questions, but still cannot figure it out, is there anybody who has a simple way to solve this question using R?

Upvotes: 2

Answers (3)

G. Grothendieck

Reputation: 270268

This creates a two column matrix m whose first column is h2 and whose second column is associated dd values. Since there is no information in the question on the form of the input we have assumed that the input is a string Lines but the htmlTreeParse line can be changed appropriately if not. Try ?htmlTreeParse for more info.

library(XML)
doc <- htmlTreeParse(Lines, asText = TRUE, useInternalNodes = TRUE)

f <- function(x) cbind(h2 = xmlValue(x), dd = xpathSApply(x, "//dd", xmlValue))
L <- xpathApply(doc, "//h2", f)
m <- do.call(rbind, L)

Here we display the h2 column and the first 10 characters of the dd column:

> cbind(h2 = m[,1], dd = substr(m[,2], 1, 10))

      h2                   dd          
 [1,] "ADB"                "Allgemeine"
 [2,] "ADB"                "American m"
 [3,] "ADB"                "Abbott, Ch"
 [4,] "AMS"                "Allgemeine"
 [5,] "AMS"                "American m"
 [6,] "AMS"                "Abbott, Ch"
 [7,] "Abbott, C. C. 1861" "Allgemeine"
 [8,] "Abbott, C. C. 1861" "American m"
 [9,] "Abbott, C. C. 1861" "Abbott, Ch"

This is the input used above:

Lines <- '<dl class="search-advanced-list">
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22ADB%22&amp;as-type=advanced" name="ADB">ADB</a></h2>
</dt>
<dd>Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker &amp; Humblot. 1875&#8211;1912.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22AMS%22&amp;as-type=advanced" name="AMS">AMS</a></h2>
</dt>
<dd>American men of science. J. McKeen Cattell, ed. Editions 1&#8211;4, New York: 1906&#8211;27.</dd>
<dt>
<h2><a id="/advanced-search?intercept=adv&amp;as-advanced=+documenttype%3Asource title:%22Abbott%2C+C.+C.+1861%22&amp;as-type=advanced" name="Abbott__C__C__1861">Abbott, C. C. 1861</a></h2>
</dt>
<dd>Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149&#8211;67.</dd>
</dl>'

Upvotes: 5

hrbrmstr

Reputation: 78832

Or, doing the scraping the proper way:

library(xml2)
library(rvest)

pg <- read_html("https://www.darwinproject.ac.uk/bibliography")

h2 <- html_text(html_nodes(pg, "dt > h2"))
head(h2)
## [1] "ADB"                            "AMS"                           
## [3] "Abbott, C. C. 1861"             "Abich, O. H. W. 1841"          
## [5] "Accum, Frederick. 1820"         "Acevedo Moraga, Fernando. 1987"

dd <- html_text(html_nodes(pg, "dd"))
head(dd)
## [1] "Allgemeine deutsche Biographie. Under the auspices of the Historical Commission of the Royal Academy of Sciences. 56 vols. Leipzig: Duncker & Humblot. 1875–1912."                                                                
## [2] "American men of science. J. McKeen Cattell, ed. Editions 1–4, New York: 1906–27."                                                                                                                                                 
## [3] "Abbott, Charles Compton. 1861. Notes on the birds of the Falkland Islands. Ibis 3: 149–67."                                                                                                                                       
## [4] "Abich, Otto Hermann Wilhelm. 1841. Geologische Betrachtungen über die vulkanischen Erscheinungen und Bildungen in Unter- und Mittel-Italien. Braunschweig."                                                                       
## [5] "Accum, Frederick. 1820. A treatise on the art of brewing, exhibiting the London practice of brewing porter, brown stout, ale, table beer, and various other kinds of malt liquors. London: Longman, Hurst, Rees, Orme, and Brown."
## [6] "Acevedo Moraga, Fernando. 1987. La Escuela de Minas de la Serena. In La Serena University, edited by Claudo Canut de Bon: 1–18. Chile."

I feel compelled to include a snippet from their ToS:

Subject to statutory allowances, extracts of material from the site may be accessed, downloaded and printed for your personal and non-commercial use and you may draw the attention of others within your organisation to material posted on the site. You may not:

use any part of the material on the site for direct or indirect commercial purposes or advantage without obtaining a licence to do so from the University or its licensors

you may not modify or alter the paper or digital copies of any material printed off or downloaded in any way

sell, resell, license, transfer, transmit, display in any form, perform, hire, lease or loan any content in whole or in part printed or downloaded from the site

systematically extract and/or re-utilise substantial parts of the content or material on the site

create and/or publish your own database that features substantial parts of this site.

If you print, copy, download or use any part of the site in breach of these terms of use, your right to use the site will cease immediately and you must at the option of the University return or destroy any copies of the material you have made.

Upvotes: 3

Navin Manaswi

Reputation: 992

htmlpattern <- "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>"
plain.text <- gsub(htmlpattern, "\\1", txt)
cat(plain.text)

Note : txt is html text

Upvotes: 1

Extract contents within html tags using R

Answers (3)

Related Questions