eflores89
eflores89

Reputation: 359

Parse text by uppercase in R

I have many large text files with the following basic composition:

text<-"this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

As you can see, it is composed of: 1) Random text, 2) Person in uppercase, 3) Speech.

I've managed to separate in a list all the words using:

textw<-unlist(strsplit(text," "))

I then find all the position of the words which are uppercase:

grep(pattern = "^[[:upper:]]*$",x = textw)

And I have separated the names of persons into a vector;

upperv<-textw[grep(pattern = "^[[:upper:]]*$",x = textw)]

The desired outcome would be a data frame or table like this:

Result<-data.frame(person=c(" ","FIRST PERSON","SECOND PERSON"),
         message=c("this is a speech test.","hi all, thank you for coming.","thank you for inviting us"))

Result
         person                       message
1                      this is a speech test.
2  FIRST PERSON hi all, thank you for coming.
3 SECOND PERSON     thank you for inviting us

I'm having trouble "linking" each message to it's author.

Also be noted: there are uppercase words which are NOT an author, for example "I". How could I specify a separation only where 2 or more uppercase words are next to each other?

In other words, if position 2 and 3 are upper case, then place as message everything from position 4 until next occurrence of double uppercases.

Any help appreciated.

Upvotes: 3

Views: 238

Answers (2)

petermeissner
petermeissner

Reputation: 12860

Basic Approach

1) to get the text I will follow Tyler Rinkers approach of splitting the text by a sequence of one and more (+) ONLY UPPER CASE LETTERS ([[:upper:]]) that might also entail spaces and colons ([ [:upper:]:]): "[[:upper:]]+[ [:upper:]:]+"

2) to extract the persons speaking the nearly the same regex is used (not allowing colons anymore): "[[:upper:]]+[ [:upper:]]+" (again, the basic idea is stolen from Tyler Rinker)

stringr

require(stringr)

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

data.frame (
    person  = c( NA,
                 unlist(str_extract_all(text, "[[:upper:]]+[ [:upper:]]+"))
                ),
    message = unlist(str_split(text, "[[:upper:]]+[ [:upper:]:]+"))
    )

##          person                        message
## 1          <NA>        this is a speech text. 
## 2  FIRST PERSON hi all, thank you for coming. 
## 3 SECOND PERSON      thank you for inviting us

stringi

require(stringi)

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

data.frame (
    person  = c( NA,
                 unlist(stri_extract_all(text, regex="[[:upper:]]+[ [:upper:]]+"))
                ),
    message = unlist(stri_split(text, regex="[[:upper:]]+[ [:upper:]:]+"))
    )

##          person                        message
## 1          <NA>        this is a speech text. 
## 2  FIRST PERSON hi all, thank you for coming. 
## 3 SECOND PERSON      thank you for inviting us

Hints (that reflect my preferences rather than rules)

1) I would prefer "[A-Z]+" over "[A-Z]{1,1000}" because in the first case on does not have to decide what might actually be a reasonable number to put in.

2) I would prefer "[[:upper:]]" over "[A-Z]" because the former works like this ...

str_extract("Á", "[[:upper:]]")
## [1] "Á"

... while the latter works like this ...

str_extract("Á", "[A-Z]")
## [1] NA

... in case of special character.

Upvotes: 2

Tyler Rinker
Tyler Rinker

Reputation: 109874

Here's one approach using the stringi package:

text <- "this is a speech text. FIRST PERSON: hi all, thank you for coming. SECOND PERSON: thank you for inviting us"

library(stringi)
txt <- unlist(stri_split_regex(text, "(?<![A-Z]{2,1000})\\s+(?=[A-Z]{2,1000})"))

data.frame(
    person = stri_extract_first_regex(txt, "[A-Z ]+(?=(:\\s))"),
    message = stri_replace_first_regex(txt, "[A-Z ]+:\\s+", "")
)


##          person                       message
## 1          <NA>        this is a speech text.
## 2  FIRST PERSON hi all, thank you for coming.
## 3 SECOND PERSON     thank you for inviting us

Upvotes: 8

Related Questions