user6583482
user6583482

Reputation: 23

R Parses incomplete text from webpages (HTML)

I am trying to parse the plain text from multiple scientific articles for subsequent text analysis. So far I use a R script by Tony Breyal based on the packages RCurl and XML. This works fine for all targeted journals, except for those published by http://www.sciencedirect.com. When I try to parse the articles from SD (and this is consistent for all tested journals I need to access from SD), the text object in R just stores the first part of the whole document in it. Unfortunately, I am not too familiar with html, but I think the problem should be in the SD html code, since it works in all other cases. I am aware that some journals are not open accessible, but I have access authorisations and the problems also occur in open access articles (check the example). This is the code from Github:

 htmlToText <- function(input, ...) {
###---PACKAGES ---###
 require(RCurl)
 require(XML)


###--- LOCAL FUNCTIONS ---###
# Determine how to grab html for a single input element
 evaluate_input <- function(input) {    
# if input is a .html file
if(file.exists(input)) {
  char.vec <- readLines(input, warn = FALSE)
  return(paste(char.vec, collapse = ""))
}

# if input is html text
if(grepl("</html>", input, fixed = TRUE)) return(input)

# if input is a URL, probably should use a regex here instead?
if(!grepl(" ", input)) {
  # downolad SSL certificate in case of https problem
  if(!file.exists("cacert.perm")) download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.perm")
  return(getURL(input, followlocation = TRUE, cainfo = "cacert.perm"))
}

# return NULL if none of the conditions above apply
return(NULL)
}

# convert HTML to plain text
convert_html_to_text <- function(html) {
doc <- htmlParse(html, asText = TRUE)
text <- xpathSApply(doc, "//text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]", xmlValue)
return(text)
}

# format text vector into one character string
collapse_text <- function(txt) {
return(paste(txt, collapse = " "))
 }

###--- MAIN ---###
# STEP 1: Evaluate input
html.list <- lapply(input, evaluate_input)

# STEP 2: Extract text from HTML
text.list <- lapply(html.list, convert_html_to_text)

# STEP 3: Return text
text.vector <- sapply(text.list, collapse_text)
return(text.vector)
}

This is now my code and an example article:

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319"
temp.text <- htmlToText(target)

The unformatted text stops somewhere in the Method section:

DNA was extracted using the MasterPure™ Yeast DNA Purification Kit (Epicentre, Madison, Wisconsin, USA) following the manufacturer's instructions.

Any suggestions/ideas?

P.S. I also tried html_text based on rvest with the same outcome.

Upvotes: 2

Views: 210

Answers (1)

hrbrmstr
hrbrmstr

Reputation: 78832

You can prbly use your existing code and just add ?np=y to the end of the URL, but this is a bit more compact:

library(rvest)
library(stringi)

target <- "http://www.sciencedirect.com/science/article/pii/S1754504816300319?np=y"

pg <- read_html(target)
pg %>%
  html_nodes(xpath=".//div[@id='centerContent']//child::node()/text()[not(ancestor::script)][not(ancestor::style)][not(ancestor::noscript)][not(ancestor::form)]") %>% 
  stri_trim() %>% 
  paste0(collapse=" ") %>% 
  write(file="output.txt")

A bit of the output (total for that article was >80K):

 Fungal Ecology Volume 22 , August 2016, Pages 61–72        175394|| Species richness 
 influences wine ecosystem function through a dominant species Primrose J. Boynton a , , , 
 Duncan Greig a , b a  Max Planck Institute for Evolutionary Biology, Plön, 24306, Germany 
 b  The Galton Laboratory, Department of Genetics, Evolution, and Environment, University 
 College London, London, WC1E 6BT, UK Received 9 November 2015, Revised 27 March 2016, 
 Accepted 15 April 2016, Available online 1 June 2016 Corresponding editor: Marie Louise
 Davey Abstract Increased species richness does not always cause increased ecosystem function. 
 Instead, richness can influence individual species with positive or negative ecosystem effects. 
 We investigated richness and function in fermenting wine, and found that richness indirectly 
 affects ecosystem function by altering the ecological dominance of Saccharomyces cerevisiae . 
 While S. cerevisiae generally dominates fermentations, it cannot dominate extremely species-rich 
 communities, probably because antagonistic species prevent it from growing. It is also diluted 
 from species-poor communities, 

Upvotes: 1

Related Questions