SLE
SLE

Reputation: 85

How to break a corpus into paragraphs using custom delimiters

I am scraping the New york Times webpages to do some natural language processing on it, I want to split the webpage into paragraphs when using corpus in order to do frequency counts on words that appear in paragraphs which also contain key words or phrases .

The below works with sentences but the paragraphs are donated by a • in NYT, so I need to replace this into how corpus reads paragraphs - anyone got any idea's? I have tried gsub("•","/n",...) and gsub("•","/r/n") but this didn't work.

If anyone knows how to do this all in tm corpus's rather than having to switch between quanteda and TM that would save some code.

 website<-read_html("https://www.nytimes.com/2017/01/03/briefing/asia-australia-briefing.html") #Read URL
     


  #Obtain any text with the paragraph Html deliminator 
  text<-website%>%
    html_nodes("p") %>%
    html_text() %>% as.character()
  
  #Collapse the string as it is currently text[1]=para1 and text[2]= para 2
  text<- str_c(text,collapse=" ")


data_corpus_para <- 
  corpus_reshape(corpus((text),to="paragraphs"))


data_corpus_para <-tolower(data_corpus_para )


containstarget <- 
  stringr::str_detect(texts(data_corpus_para ), "pull out of peace talks") #Random string in only one of the paragraphs to proof concept

#Filter for the para's that only contain the sentence above
data_corpus_para <- 
  corpus_subset(data_corpus_para , containstarget)                 

data_corpus_para <-corpus_reshape(data_corpus_para , to = "documents")


#There are quanteda corpus and TM Corpuses. And so I have to convert to a dataframe and then make back into a vcorupus.. this is very messy

data_corpus_para <-quanteda::convert(data_corpus_para )
data_corpus_para_VCorpus<-tm::VCorpus(tm::VectorSource(data_corpus_para$text))

dt.dtm = tm::DocumentTermMatrix(data_corpus_para_VCorpus)
tm::findFreqTerms(dt.dtm, 1)

Upvotes: 0

Views: 497

Answers (1)

Ken Benoit
Ken Benoit

Reputation: 14902

If the paragraph delimiter is "•", then you can use corpus_segment():

library("quanteda")
## Package version: 3.0.0
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

txt <- "
• This is the first paragraph.
This is still the first paragraph.
• Here is the third paragraph.  Last sentence"

corpus(txt) %>%
  corpus_segment(pattern = "•")
## Corpus consisting of 2 documents and 1 docvar.
## text1.1 :
## "This is the first paragraph. This is still the first paragra..."
## 
## text1.2 :
## "Here is the third paragraph.  Last sentence"

Created on 2021-04-10 by the reprex package (v1.0.0)

Upvotes: 1

Related Questions