Reputation: 359
I have many large text files with the following basic composition:
text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
As you can see, it is composed of: 1) Random text, 2) Person in uppercase, 3) Speech.
I've managed to separate in a list all the words using:
textw<-unlist(strsplit(text," "))
I then find all the position of the words which are uppercase:
grep(pattern = "^[[:upper:]]*$",x = textw)
And I have separated the names of persons into a vector;
upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]
The desired outcome would be a data frame or table like this:
Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))
Result
person message
1 this is a speech test.
2 FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON thank you for inviting us
I'm having trouble "linking" each message to it's author.
Also be noted: there are uppercase words which are NOT an author, for example "I". How could I specify a separation only where 2 or more uppercase words are next to each other?
In other words, if position 2 and 3 are upper case, then place as message everything from position 4 until next occurrence of double uppercases.
Any help appreciated.
Upvotes: 3
Views: 238
Reputation: 12860
Basic Approach
1) to get the text I will follow Tyler Rinkers approach of splitting the text by a sequence of one and more (+
) ONLY UPPER CASE LETTERS ([[:upper:]]
) that might also entail spaces and colons ([ [:upper:]:]
): "[[:upper:]]+[ [:upper:]:]+"
2) to extract the persons speaking the nearly the same regex is used (not allowing colons anymore): "[[:upper:]]+[ [:upper:]]+"
(again, the basic idea is stolen from Tyler Rinker)
stringr
require(stringr)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
stringi
require(stringi)
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
data.frame (
person = c( NA,
unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
),
message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Hints (that reflect my preferences rather than rules)
1) I would prefer "[A-Z]+"
over "[A-Z]{1,1000}"
because in the first case on does not have to decide what might actually be a reasonable number to put in.
2) I would prefer "[[:upper:]]"
over "[A-Z]"
because the former works like this ...
str_extract("Á", "[[:upper:]]")
## [1] "Á"
... while the latter works like this ...
str_extract("Á", "[A-Z]")
## [1] NA
... in case of special character.
Upvotes: 2
Reputation: 109874
Here's one approach using the stringi package:
text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"
library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))
data.frame(
person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)
## person message
## 1 <NA> this is a speech text.
## 2 FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON thank you for inviting us
Upvotes: 8