Puneet Mathur
Puneet Mathur

Reputation: 79

R Programming Need String based unique solution for splitting huge text

I have a text file with a sample text like below all in small case:

"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a 
once powerful oil minister and former head of state oil company pdvsa, in 
connection with an alleged $4.8 billion vienna-based corruption scheme, the 
state prosecutor's office announced on friday.


5.5 hours ago
— reuters


amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its 
online marketplace when they mistakenly search for "brikenstock", 
"birkenstok", "bierkenstock" and other variations in google.


6 hours ago
— business standard"

What I need in R is to get these two pieces of text, separated out.

The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.

Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.

The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.

Would you know how to accomplish this in R?

Upvotes: 0

Views: 66

Answers (2)

Arani
Arani

Reputation: 823

Here's what I could do. I do not like the loop in this, but I could not vectorize it. Hopefully this answer will at least serve as a starting point for other better answers.

Assumptions: All publisher names are preceeded by "— "

TEXT <- read.delim2("C:/Users/Arani.das/Desktop/TEXT.txt", header=FALSE, quote="", stringsAsFactors=F)
TEXT$Publisher <- grepl("— ", TEXT$V1)
TEXT$V1 <- gsub("^\\s+|\\s+$", "", TEXT$V1) #trim whitespaces in start and end of line
TEXT$FLAG <- 1 #grouping variable
for(i in 2:nrow(TEXT)){
  if(TEXT$Publisher[i-1]==T){TEXT$FLAG[i]=TEXT$FLAG[i]+1}else{TEXT$FLAG[i]=TEXT$FLAG[i-1]}
} # Grouping entries
TEXT <- data.table::data.table(TEXT, key="FLAG")
TEXT2 <- TEXT[, list(News=paste0(V1[1:(length(V1)-2)], collapse=" "), Time=V1[length(V1)-1], Publisher=V1[length(V1)]), by="FLAG"]

Output:

FLAG News          Time          Publisher
1    Venezuela...  5.5 hours ago — reuters
2    amazon...     6 hours ago   — business standard

Upvotes: 1

IRTFM
IRTFM

Reputation: 263332

Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:

 Lines <- readLines("Lines.txt")  # from file in wd()
 split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"              
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "   
[3] "once powerful oil minister and former head of state oil company pdvsa, in "  
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."                              
[6] "5.5 hours ago"                                                               
[7] "— reuters"                                                                   

$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"  
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "   
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."       
[5] "6 hours ago"                                                            
[6] "— business standard'" 

It's not a regular "-". Its a "—". And notice the by default readLines will omit the blank lines.

Upvotes: 6

Related Questions