R Programming Need String based unique solution for splitting huge text

Question

I have a text file with a sample text like below all in small case:

"venezuela probes ex-oil czar ramirez over alleged graft scheme
caracas/houston (reuters) - venezuela is investigating rafael ramirez, a 
once powerful oil minister and former head of state oil company pdvsa, in 
connection with an alleged $4.8 billion vienna-based corruption scheme, the 
state prosecutor's office announced on friday.


5.5 hours ago
— reuters


amazon ordered not to pull in customers who can't spell `birkenstock'
a german court has ordered amazon not to lure internet shoppers to its 
online marketplace when they mistakenly search for "brikenstock", 
"birkenstok", "bierkenstock" and other variations in google.


6 hours ago
— business standard"

What I need in R is to get these two pieces of text, separated out.

The first piece of text would correspond with the text1 variable and the second piece of text should correspond with the text2 variable.

Please remember I have many text-like paragraphs in this file. The solution would have to work for, say, 100,000 texts.

The only thing I thought that could be used as a delimiter is "—" but with that I lose the source of the information such as "reuters" or "business standard". I need that as well.

Would you know how to accomplish this in R?

IRTFM · Accepted Answer

Read the text from field with readLines and then split on the shifted cumsum of the occurence of that special dash in from of the publisher:

 Lines <- readLines("Lines.txt")  # from file in wd()
 split(Lines, cumsum(c(0, head(grepl("—", Lines),-1))) )
#--------------
$`0`
[1] "venezuela probes ex-oil czar ramirez over alleged graft scheme"              
[2] "caracas/houston (reuters) - venezuela is investigating rafael ramirez, a "   
[3] "once powerful oil minister and former head of state oil company pdvsa, in "  
[4] "connection with an alleged $4.8 billion vienna-based corruption scheme, the "
[5] "state prosecutor's office announced on friday."                              
[6] "5.5 hours ago"                                                               
[7] "— reuters"                                                                   

$`1`
[1] "amazon ordered not to pull in customers who can't spell `birkenstock'"  
[2] "a german court has ordered amazon not to lure internet shoppers to its "
[3] "online marketplace when they mistakenly search for \"brikenstock\", "   
[4] "\"birkenstok\", \"bierkenstock\" and other variations in google."       
[5] "6 hours ago"                                                            
[6] "— business standard'"

It's not a regular "-". Its a "—". ~~And notice the by default readLines will omit the blank lines.~~

R Programming Need String based unique solution for splitting huge text

Answers (2)

Related Questions