J. Tang
J. Tang

Reputation: 11

How to find out all the capital words in a corpus in R

So, I have a document corpus and i need to find all the words which are all capital(i.e., every character in that word is capital) in all the documents in R. I am not sure how to find that. I have looked at the text mining 'tm' package in R and there is no such functions which can find that.

Input String: "Russia Is THE BiggEST cOUNTRY"

Output required: "THE"

How to do this using "tm" package?

Upvotes: 1

Views: 1612

Answers (3)

Arun kumar mahesh
Arun kumar mahesh

Reputation: 2359

You can use gregexpr and regmatches:

unlist(regmatches(abc, gregexpr('\\b[A-Z]+\\b', abc)))
[1] "THE"

data

abc <- "Russia Is THE BiggEST cOUNTRY"

Upvotes: 2

Sandipan Dey
Sandipan Dey

Reputation: 23099

With stringr (if you want to find all such words (as a vector) with caps not just the first one):

s = "Russia Is THE BiggEST cOUNTRY IN the WORLD"
library(stringr)
unlist(str_match_all(s, "\\b[A-Z]+\\b"))
[1] "THE"   "IN"    "WORLD"

Upvotes: 2

Shenglin Chen
Shenglin Chen

Reputation: 4554

Try to use regular expression.

sub('.*(\\b[A-Z]+\\b).*','\\1',string)
#[1] "THE"

Upvotes: 1

Related Questions