Extract alphanumeric words and words with more than 1 uppercase using R

Question

I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.

Below is an example of the string and my desired output for it.

    x <- c("123AB123 Electrical CDe FG123-4 ...", 
           "12/1/17 ABCD How are you today A123B", 
           "20.9.12 Eat / Drink XY1234 for PQRS1",
           "Going home H123a1 ab-cd1",
           "Change channel for al1234 to al5678")

    #Desired Output
    #[1] "123AB123 CDe FG123-4"  "ABCD A123B"  "XY1234 PQRS"  
    #[2] "H123a1 ab-cd1"  "al1234 al5678"

I have come across 2 separate solutions so far on Stack Overflow:

Extracts all words that contain a number --> Not helpful to me because the column I'm applying the function to contains many date strings; "12/1/17 ABCD How are you today A123B"
Identify strings that have more than one caps/uppercase --> Pierre Lafortune has provided the following solution:

how-to-count-capslock-in-string-using-r

    library(stringr)
    str_count(x, "\b[A-Z]{2,}\b")

His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.

Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.

Wiktor Stribiżew · Accepted Answer

A single regex solution will also work:

> res <- str_extract_all(x, "(? unlist(res)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

This will also work with regmatches in base R using the PCRE regex engine:

> res2 <- regmatches(x, gregexpr("(? unlist(res2)
 [1] "123AB123" "CDe"      "FG123-4"  "ABCD"     "A123B"    "XY1234"  
 [7] "PQRS1"    "H123a1"   "ab-cd1"   "al1234"   "al5678"

Why does it work?

(? - finds a position after a whitespace or start of string


(?: - start of a non-capturing group that has two alternative patterns defined:


(?=\S*\p{L})(?=\S*\d)\S+ 


(?=\S*\p{L}) - make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \S* with [^\s\p{L}]*)
(?=\S*\d) - make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \S* with [^\s\d]*)
\S+ - match 1 or more non-whitespace chars 

| - or
(?:\S*\p{Lu}){2}\S*:


(?:\S*\p{Lu}){2} - 2 occurrences of 0+ non-whitespace chars (\S*, for better performace, replace with [^\s\p{Lu}]*) followed with 1 uppercase letter (\p{Lu})
\S* - 0+ non-whitespace chars


) - end of the non-capturing group.



To join the matches pertaining to each character vector, you may use

unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))


See an online R demo.

Output:

[1] "123AB123 CDe FG123-4" "ABCD A123B"           "XY1234 PQRS1"        
[4] "H123a1 ab-cd1"        "al1234 al5678"

Extract alphanumeric words and words with more than 1 uppercase using R

Answers (2)

Related Questions