hersh476
hersh476

Reputation: 41

Extract alphanumeric words and words with more than 1 uppercase using R

I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.

Below is an example of the string and my desired output for it.

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

I have come across 2 separate solutions so far on Stack Overflow:

  1. Extracts all words that contain a number --> Not helpful to me because the column I'm applying the function to contains many date strings; "12/1/17 ABCD How are you today A123B"
  2. Identify strings that have more than one caps/uppercase --> Pierre Lafortune has provided the following solution:

how-to-count-capslock-in-string-using-r

    library(stringr)
    str_count(x, "\\b[A-Z]{2,}\\b") 

His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.

Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.

Upvotes: 3

Views: 1172

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627536

A single regex solution will also work:

> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

This will also work with regmatches in base R using the PCRE regex engine:

> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678" 

Why does it work?

  • (?<!\\S) - finds a position after a whitespace or start of string
  • (?: - start of a non-capturing group that has two alternative patterns defined:
    • (?=\\S*\\p{L})(?=\\S*\\d)\\S+
      • (?=\\S*\\p{L}) - make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\p{L}]*)
      • (?=\\S*\\d) - make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \\S* with [^\\s\\d]*)
      • \\S+ - match 1 or more non-whitespace chars
    • | - or
    • (?:\\S*\\p{Lu}){2}\\S*:
      • (?:\\S*\\p{Lu}){2} - 2 occurrences of 0+ non-whitespace chars (\\S*, for better performace, replace with [^\\s\\p{Lu}]*) followed with 1 uppercase letter (\\p{Lu})
      • \\S* - 0+ non-whitespace chars
  • ) - end of the non-capturing group.

To join the matches pertaining to each character vector, you may use

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))

See an online R demo.

Output:

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678" 

Upvotes: 2

RyanStochastic
RyanStochastic

Reputation: 4083

This works:

library(stringr)

# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))

# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1

# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')

# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')

# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]

 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678" 

Upvotes: 2

Related Questions