Reputation: 41
I am new to R programming and want to try extracting alphanumeric words AND words containing more than 1 uppercase.
Below is an example of the string and my desired output for it.
x <- c("123AB123 Electrical CDe FG123-4 ...",
"12/1/17 ABCD How are you today A123B",
"20.9.12 Eat / Drink XY1234 for PQRS1",
"Going home H123a1 ab-cd1",
"Change channel for al1234 to al5678")
#Desired Output
#[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS"
#[2] "H123a1 ab-cd1" "al1234 al5678"
I have come across 2 separate solutions so far on Stack Overflow:
how-to-count-capslock-in-string-using-r
library(stringr)
str_count(x, "\\b[A-Z]{2,}\\b")
His code provides the number of times a string has more than 1 uppercase but I want to extract those words in addition to extracting alphanumeric words too.
Forgive me if my question or research is not comprehensive enough. I will post my researched solution for extracting all words containing a number in 12 hours when i have access to my work station which contains R and the dataset.
Upvotes: 3
Views: 1172
Reputation: 627536
A single regex solution will also work:
> res <- str_extract_all(x, "(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)")
> unlist(res)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
This will also work with regmatches
in base R using the PCRE regex engine:
> res2 <- regmatches(x, gregexpr("(?<!\\S)(?:(?=\\S*\\p{L})(?=\\S*\\d)\\S+|(?:\\S*\\p{Lu}){2}\\S*)", x, perl=TRUE))
> unlist(res2)
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
Why does it work?
(?<!\\S)
- finds a position after a whitespace or start of string(?:
- start of a non-capturing group that has two alternative patterns defined:
(?=\\S*\\p{L})(?=\\S*\\d)\\S+
(?=\\S*\\p{L})
- make sure there is a letter after 0+ non-whitespace chars (for better performance, replace \\S*
with [^\\s\\p{L}]*
)(?=\\S*\\d)
- make sure there is a digit after 0+ non-whitespace chars (for better performance, replace \\S*
with [^\\s\\d]*
)\\S+
- match 1 or more non-whitespace chars |
- or(?:\\S*\\p{Lu}){2}\\S*
:
(?:\\S*\\p{Lu}){2}
- 2 occurrences of 0+ non-whitespace chars (\\S*
, for better performace, replace with [^\\s\\p{Lu}]*
) followed with 1 uppercase letter (\\p{Lu}
)\\S*
- 0+ non-whitespace chars)
- end of the non-capturing group.To join the matches pertaining to each character vector, you may use
unlist(lapply(res, function(c) paste(unlist(c), collapse=" ")))
See an online R demo.
Output:
[1] "123AB123 CDe FG123-4" "ABCD A123B" "XY1234 PQRS1"
[4] "H123a1 ab-cd1" "al1234 al5678"
Upvotes: 2
Reputation: 4083
This works:
library(stringr)
# split words from strings into one-word-per element vector
y <- unlist(str_split(x, ' '))
# find strings with at least 2 uppercase
uppers <- str_count(y, '[A-Z]')>1
# find strings with at least 1 letter
alphas <- str_detect(y, '[:alpha:]')
# find strings with at least 1 number
nums <- str_detect(y, '[:digit:]')
# subset vector to those that have 2 uppercase OR a letter AND a number
y[uppers | (alphas & nums)]
[1] "123AB123" "CDe" "FG123-4" "ABCD" "A123B" "XY1234"
[7] "PQRS1" "H123a1" "ab-cd1" "al1234" "al5678"
Upvotes: 2