coding_heart
coding_heart

Reputation: 1295

How to Count Text Lines in R?

I would like to calculate the number of lines spoken by different speakers from a text using R (it is a transcript of parliamentary speaking records). The basic text looks like:

MR. JOHN: This activity has been going on in Tororo and I took it up with the office of the DPC. He told me that he was not aware of it.
MS. SMITH: Yes, I am aware of that. 
MR. LEHMAN: Therefore, I am seeking your guidance, Madam Speaker, and requesting that you re-assign the duty.  
MR. JOHN: Thank you

In the documents, each speaker has an identifier that begins with MR/MS and is always capitalized. I would like to create a dataset that counts the number of lines spoken for each speaker for each time spoke in a document such that the above text would result in:

MR. JOHN: 2
MS. SMITH: 1
MR. LEHMAN: 2
MR. JOHN: 1

Thanks for pointers using R!

Upvotes: 3

Views: 2622

Answers (1)

Arun
Arun

Reputation: 118799

You can use the pattern : to split the string by and then use table:

table(sapply(strsplit(x, ":"), "[[", 1))
#   MR. JOHN MR. LEHMAN  MS. SMITH 
#          2          1          1 

strsplit - splits strings at : and results in a list
sapply with [[ - selects the first part element of the list
table - gets the frequency

Edit: Following OP's comment. You can save the transcripts in a text file and use readLines to read the text in R.

tt <- readLines("./tmp.txt")

Now, we'll have to find a pattern by which to filter this text for just those lines with the names of those who're speaking. I can think of two approaches based on what I saw in the transcript you linked.

  • Check for a : and then lookbehind the : to see if it is any of A-Z or [:punct:] (that is, if the character occurring before the : is any of the capital letters or any punctuation marks - this is because some of them have a ) before the :).

You can use strsplit followed by sapply (as shown below)

Using strsplit:

# filter tt by pattern
tt.f <- tt[grepl("(?<=[A-Z[:punct:]]):", tt, perl = TRUE)]
# Now you should only have the required lines, use the command above:

out <- table(sapply(strsplit(tt.f, ":"), "[[", 1))

There are other approaches possible (using gsub for ex:) or alternate patterns. But this should give you an idea of the approach. If the pattern should differ, then you should just change it to capture all required lines.

Of course, this assumes that there is no other line, for example, like this:

"Mr. Chariman, whatever (bla bla): It is not a problem"

Because our pattern will give TRUE for ):. If this happens in the text, you'll have to find a better pattern.

Upvotes: 10

Related Questions