Reputation: 99361
Take the following character vector x
x <- c("1 Date in the form", "2 Number of game",
"3 Day of week", "4-5 Visiting team and league")
My desired result is the following vector, with the first capitalized word from each string and, if the string contains a -
, also the last word.
[1] "Date" "Number" "Day" "Visiting" "league"
So instead of doing
unlist(sapply(strsplit(x, "[[:blank:]]+|, "), function(y){
if(grepl("[-]", y[1])) c(y[2], tail(y,1)) else y[2]
}))
to get the result, I figured I could try to shorten it to a regular expression. The result is almost the "opposite" of this regular expression in sub
. I've tried it every which way to get the opposite, with different varieties of [^A-Za-z]+
among others, and haven't been successful.
> sub("[A-Z][a-z]+", "", x)
[1] "1 in the form" "2 of game"
[3] "3 of week" "4-5 team and league"
So I guess this is a two part question.
with sub()
or gsub()
, how can I return the opposite of "[A-Z][a-z]+"
?
How can I write the regular expression to read like "Match the first capitalized word and, if the string contains a -
, also match the last word."?
Upvotes: 2
Views: 658
Reputation: 1866
Here is a solution using three regular expressions.
cap_words <- regmatches(x, regexpr("[A-Z][a-z]+", x)) # capitalised word
last_words <- sub(".*\\s", "", x[grep("-", x)]) # get last word in strings with a dash
c(cap_words, last_words)
# [1] "Date" "Number" "Day" "Visiting" "league"
Upvotes: 2
Reputation: 81733
Here are some suggestions:
To extract the first capitalized word with sub
, you can use
sub(".*\\b([A-Z].*?)\\b.*", "\\1", x)
#[1] "Date" "Number" "Day" "Visiting"
where \\b
represents a word boundary.
You can also extract all word with one sub
command, but note that you have to apply an extra step because the length of the vector returned by sub
is identical to the length of the input vector x
.
The following regular expression makes use of a lookahead ((?=.*-)
) to test if there's a -
in the string. If it is the case, two words are extracted. If it's not present, the regular expression after the logical or (|
) is applied and returns the first capitalized word only.
res <- sub("(?:(?=.*-).*\\b([A-Z].*?\\b ).*\\b(.+)$)|(?:.*\\b([A-Z].*?)\\b.*)",
"\\1\\2\\3", x, perl = TRUE)
# [1] "Date" "Number" "Day" "Visiting league"
One additional step is necessary in order to separate multiple words in the same string:
unlist(strsplit(res, " ", fixed = TRUE))
# [1] "Date" "Number" "Day" "Visiting" "league"
Upvotes: 3