How to select part of a text in R

Question

I have an HTML file which consists of 5 different articles and I would like to extract each one of these articles separately in R and run some analysis per article. Each article starts with < doc> and ends with < /doc> and also has a document number.Example:


 NA123455-0001 
 1 

NASA one-year astronaut Scott Kelly speaks after coming home to Houston on  
March 3, 2016. Behind Kelly, 
from left to right: U.S. Second Lady Jill Biden; Kelly's identical in      
brother, Mark; 
John Holdren, Assistant to the President for Science and ...



 KA25637-1215 
 65 


February 1, 2014, Sunday 




WASHINGTON -- Former Republican presidential nominee Mitt Romney 
is charging into the increasingly divisive 2016 GOP 
White House sweepstakes Thursday with a harsh takedown of front-runner 
Donald Trump, calling him a "phony" and exhorting fellow 




 JN1234567-1225 
 67 


March 5, 2003




SEOUL—New U.S.-led efforts to cut funding for North Korea's nuclearweapons
program through targeted 
sanctions risk faltering because of Pyongyang's willingness to divert all
available resources to its 
military, even at the risk of economic collapse ...

I have uploaded the url by using readLines() function and combined all lines together by using

 articles<- paste(articles, collapse=" ")

I would like to select first article which is between < doc>..< /doc> and assign it to article1, and second one to article2 and so on.

Could you please advise how to construct the function in order to select each one of these articles separately?

alistaire · Accepted Answer

You could use strsplit, which splits strings on whatever text or regex you give it. It will give you a list with one item for each part of the string between the splitting string, which you can then subset into different variables, if you like. (You could use other regex functions, as well, if you prefer.)

splitArticles <- strsplit(articles, '')

You'll still need to chop out the tags (plus a lot of other cruft, if you just want the text), but it's a start.

A more typical way to do the same thing would be to use a package for html scraping/parsing. Using the rvest package, you'd need something like

library(rvest)
read_html(articles) %>% html_nodes('doc') %>% html_text()

which will give you a character vector of the contents of tags. It may take more cleaning, especially if there are whitespace characters that you need to clean. Picking your selector carefully for html_nodes may help you avoid some of this; it looks like if you used p instead of doc, you're more likely to just get the text.

How to select part of a text in R

Answers (2)

Related Questions