Reputation: 13
I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc>
and ends with < /doc>
and also has a document number.Example:
<doc>
<docno> NA123455-0001 </docno>
<docid> 1 </docid>
<p>
NASA one-year astronaut Scott Kelly speaks after coming home to Houston on
March 3, 2016. Behind Kelly,
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in
brother, Mark;
John Holdren, Assistant to the President for Science and ...
</p>
</doc>
<doc>
<docno> KA25637-1215 </docno>
<docid> 65 </docid>
<date>
<p>
February 1, 2014, Sunday
</p>
</date>
<section>
<p>
WASHINGTON -- Former Republican presidential nominee Mitt Romney
is charging into the increasingly divisive 2016 GOP
White House sweepstakes Thursday with a harsh takedown of front-runner
Donald Trump, calling him a "phony" and exhorting fellow
</p>
</type>
</doc>
<doc>
<docno> JN1234567-1225 </docno>
<docid> 67 </docid>
<date>
<p>
March 5, 2003
</p>
</date>
<section>
<p>
SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its
military, even at the risk of economic collapse ...
</p>
</doc>
I have uploaded the url by using readLines()
function and combined all lines together by using
articles<- paste(articles, collapse=" ")
I would like to select first article which is between < doc>..< /doc>
and assign it to article1
, and second one to article2
and so on.
Could you please advise how to construct the function in order to select each one of these articles separately?
Upvotes: 0
Views: 1993
Reputation: 43354
You could use strsplit
, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)
splitArticles <- strsplit(articles, '<doc>')
You'll still need to chop out the </doc>
tags (plus a lot of other cruft, if you just want the text), but it's a start.
A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest
package, you'd need something like
library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()
which will give you a character vector of the contents of <doc>
tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes
may help you avoid some of this; it looks like if you used p
instead of doc
, you're more likely to just get the text.
Upvotes: 1
Reputation: 5993
Simplest solution is use strsplit
:
art_list <- strsplit(s, "<doc>")
art_list <- art_list[art_list != ""]
ids <- gsub(".*<docid>|</docid>.*", "", art_list[[i]] )
ids <- ids[ids != ""]
for (i in 1: length(unlist(art_list)) ){
assign( paste("article", ids[i], sep = "_") , gsub(".*<doc>|</doc>.*", "", unlist(art_list) )[i] )}
Upvotes: 0