RxT
RxT

Reputation: 546

Extract uppercase words till the first lowercase letter

I need to extract the first part of a text, which is uppercase till the first letter lowercase.

For example, I have the text: "IV LONG TEXT HERE and now the Text End HERE"

I want to extract the "IV LONG TEXT HERE".

I have been trying something like this:

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract_all(text, "[A-Z]")

but I'm failing at the regex.

Upvotes: 0

Views: 553

Answers (3)

Syntax Error
Syntax Error

Reputation: 72

The below code sample should work.

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract_all(text, "\\w.*[A-Z] \\b")

Output :

[1] 'IV LONG TEXT HERE '

Interpretation :

Return any word character (\w) that appears zero times or more (.*) , satisfies the uppercase ([A-Z]) range and ends up with space ( \b).

Upvotes: 1

The fourth bird
The fourth bird

Reputation: 163632

You could use str_extract, with a pattern to match a single uppercase char and optionally match spaces and uppercase chars ending with another uppercase char.

\b[A-Z](?:[A-Z ]*[A-Z])?\b

Explanation

  • \b[A-Z] A word boundary to prevent a partial word match, then match a single char A-Z
  • (?: Non capture group to match as a whole
    • [A-Z ]*[A-Z] Match optional chars A-Z or a space and match a char A-Z
  • )? Close the non capture group and make it optional
  • \b A word boundary

Example

text <- "IV LONG TEXT HERE and now the Text End HERE"

stringr::str_extract(text, "\\b[A-Z](?:[A-Z ]*[A-Z])?\\b")

Output

[1] "IV LONG TEXT HERE"

Upvotes: 1

akrun
akrun

Reputation: 887971

Instead of str_extract use str_replace or str_remove

library(stringr)
# match one or more space (\\s+) followed by
# one or more lower case letters ([a-z]+) and rest of the characters (.*)
# to remove those matched characters
str_remove(text, "\\s+[a-z]+.*")
[1] "IV LONG TEXT HERE"
# or match one or more upper case letters including spaces ([A-Z ]+)
# capture as group `()` followed one or more space (\\s+) and rest of
#characters (.*), replace with the backreference (\\1) of captured group
str_replace(text, "([A-Z ]+)\\s+.*", "\\1")
[1] "IV LONG TEXT HERE"

Upvotes: 1

Related Questions