Parse text by uppercase in R

Question

I have many large text files with the following basic composition:

text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

As you can see, it is composed of: 1) Random text, 2) Person in uppercase, 3) Speech.

I've managed to separate in a list all the words using:

textw<-unlist(strsplit(text," "))

I then find all the position of the words which are uppercase:

grep(pattern = "^[[:upper:]]*$",x = textw)

And I have separated the names of persons into a vector;

upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]

The desired outcome would be a data frame or table like this:

Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
         message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))

Result
         person                       message
1                      this is a speech test.
2  FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON     thank you for inviting us

I'm having trouble "linking" each message to it's author.

Also be noted: there are uppercase words which are NOT an author, for example "I". How could I specify a separation only where 2 or more uppercase words are next to each other?

In other words, if position 2 and 3 are upper case, then place as message everything from position 4 until next occurrence of double uppercases.

Any help appreciated.

Tyler Rinker · Accepted Answer

Here's one approach using the stringi package:

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

library(stringi)
txt <- unlist(stri_split_regex(text, "(?        this is a speech text.
## 2  FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON     thank you for inviting us

Parse text by uppercase in R

Answers (2)

Related Questions