jackStinger
jackStinger

Reputation: 2055

Extracting Words of specific length in R using regular expressions

I have a code like (I got it here):

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("\\<[a-z]\\{4,10\\}\\>","",m)
x

I tried other ways of doing it, like

m<- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

x<- gsub("[^(\\b.{4,10}\\b)]","",m)
x

I need to remove words which are lesser than 4 or greater than 10 in length. Where am I going wrong?

Upvotes: 10

Views: 7051

Answers (6)

agstudy
agstudy

Reputation: 121568

  gsub("\\b[a-zA-Z0-9]{4,10}\\b", "", m) 
 "! # is gr8. I  likewhatishappening ! The  of   is ! the aforementioned  is ! #Wow"

Let's explain the regular expression terms :

  1. \b matches at a position that is called a "word boundary". This match is zero-length.
  2. [a-zA-Z0-9] :alphanumeric
  3. {4,10} :{min,max}

if you want to get the negation of this so , you put it between() and you take //1

gsub("([\\b[a-zA-Z0-9]{4,10}\\b])", "//1", m) 

"Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow"

It is funny to see that words with 4 letters exist in the 2 regexpr.

Upvotes: 12

jackStinger
jackStinger

Reputation: 2055

Derived from answers from Alaxender & agstudy:

x<- gsub("\\b[a-zA-Z0-9]{1,3}\\b|\\b[a-zA-Z0-9]{10,}\\b", "", m)

Working now!

Thanks a ton, guyz!

Upvotes: 1

Alexander Taver
Alexander Taver

Reputation: 474

I'm not familiar with R and don't know which classes or other features it supports in regular expressions patterns. Without them the pattern would be like this

[^A-z0-9]([A-z0-9]{1,3}|[A-z0-9]{11,})[^A-z0-9]

Upvotes: 0

Matt
Matt

Reputation: 17629

This might get you started:

m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")
y <- gsub("\\b[a-zA-Z0-9]{1,3}\\b", "", m) # replace words shorter than 4
y <- gsub("\\b[a-zA-Z0-9]{10,}\\b", "", y) # replace words longer than 10
y <- gsub("\\s+\\.\\s+ ", ". ", y) # replace stray dots, eg "Foo  .  Bar" -> "Foo. Bar"
y <- gsub("\\s+", " ", y) # replace multiple spaces with one space
y <- gsub("#\\b+", "", y) # remove leftover hash characters from hashtags
y <- gsub("^\\s+|\\s+$", "", y) # remove leading and trailing whitespaces
y
# [1] "Hello! London. really here! alcomb Mount Everest excellent! place amazing!"

Upvotes: 1

Wojciech Sobala
Wojciech Sobala

Reputation: 7551

gsub(" [^ ]{1,3} | [^ ]{11,} "," ",m)
[1] "Hello! #London gr8. really here! alcomb Mount Everest excellent! aforementioned
     place amazing! #Wow"

Upvotes: 1

Anthony Damico
Anthony Damico

Reputation: 6104

# starting string
m <- c("Hello! #London is gr8. I really likewhatishappening here! The alcomb of Mount Everest is excellent! the aforementioned place is amazing! #Wow")

# remove punctuation (optional)
v <- gsub("[[:punct:]]", " ", m)

# split into distinct words
w <- strsplit( v , " " )

# calculate the length of each word
x <- nchar( w[[1]] )

# keep only words with length 4, 5, 6, 7, 8, 9, or 10
y <- w[[1]][ x %in% 4:10 ]

# string 'em back together
z <- paste( unlist( y ), collapse = " " )

# voila
z

Upvotes: 1

Related Questions