spindoctor
spindoctor

Reputation: 1895

Returning text between a starting and ending regular expression

I am working on a regular expression to extract some text from files downloaded from a newspaper database. The files are mostly well formatted. However, the full text of each article starts with a well-defined phrase ^Full text:. However, the ending of the full-text is not demarcated. The best that I can figure is that the full text ends with a variety of metadata tags that look like: Subject: , CREDIT:, Credit.

So, I can certainly get the start of the article. But, I am having a great deal of difficulty finding a way to select the text between the start and the end of the full text.

This is complicated by two factors. First, obviously the ending string varies, although I feel like I could settle on something like: `^[:alnum:]{5,}: ' and that would capture the ending. But the other complicating factor is that there are similar tags that appear prior to the start of the full text. How do I get R to only return the text between the Full text regex and the ending regex?

test<-c('Document 1', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Subject: A subject', 'Publication: Publication', 'Location: A country')

test2<-c('Document 2', 'Article title', 'Author: Author Name', 'https://a/url', 'Abstract: none', 'Full text: some article text that I need to capture','the second line of the article that I need to capture', 'Credit: A subject', 'Publication: Publication', 'Location: A country')

My current attempt is here:

test[(grep('Full text:', test)+1):grep('^[:alnum:]{5,}: ', test)]

Thank you.

Upvotes: 0

Views: 103

Answers (1)

IceCreamToucan
IceCreamToucan

Reputation: 28675

This just searches for the element matching 'Full text:', then the next element after that matching ':'

get_text <- function(x){
  start <- grep('Full text:', x)
  end <- grep(':', x) 
  end <- end[which(end > start)[1]] - 1
  x[start:end]
}

get_text(test)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"
get_text(test2)
# [1] "Full text: some article text that I need to capture"  
# [2] "the second line of the article that I need to capture"

Upvotes: 1

Related Questions