MustardRecord
MustardRecord

Reputation: 305

R: extract substring with capital letters from string

I have a dataframe with strings in a column. How could I extract only the substrings that are in capital letters and add them to another column?

This is an example:

    fecha          incident
1   2020-12-01     Check GENERATOR
2   2020-12-01     Check BLADE
3   2020-12-02     Problem in GENERATOR
4   2020-12-01     Check YAW
5   2020-12-02     Alarm in SAFETY SYSTEM

And I would like to create another column as follows:

    fecha          incident                  system
1   2020-12-01     Check GENERATOR           GENERATOR
2   2020-12-01     Check BLADE               BLADE
3   2020-12-02     Problem in GENERATOR      GENERATOR
4   2020-12-01     Check YAW                 YAW
5   2020-12-02     Alarm in SAFETY SYSTEM    SAFETY SYSTEM

I have tried with str_sub or str_extract_all using a regex but I believe I'm doing thigs wrong.

Upvotes: 4

Views: 5116

Answers (3)

Ronak Shah
Ronak Shah

Reputation: 389325

If there are cases when the upper-case letters are not next to each other you can use str_extract_all to extract all the capital letters in a sentence and then paste them together.

sapply(stringr::str_extract_all(df$incident, '[A-Z]{2,}'),paste0, collapse = ' ')
#[1] "GENERATOR"  "BLADE"    "GENERATOR"     "YAW"     "SAFETY SYSTEM"

Upvotes: 0

Mario Niepel
Mario Niepel

Reputation: 1175

You can use str_extract if you want to work in a dataframe and tie it into a tidyverse workflow.

The regex asks either for capital letters or space and there need to be two or more consecutive ones (so it does not find capitalized words). str_trim removes the white-space that can get picked up if the capitalized word is not at the end of the string.

Note that this code snipped will only extract the first capitalized words connected via a space. If there are capitalized words in different parts of the string, only the first one will be returned.

library(tidyverse)
x <- c("CAPITAL and not Capital", "one more CAP word", "MULTIPLE CAPITAL words", "CAP words NOT connected")
cap <- str_trim(str_extract(x, "([:upper:]|[:space:]){2,}"))
cap
#> [1] "CAPITAL"          "CAP"              "MULTIPLE CAPITAL" "CAP"

Created on 2021-01-08 by the reprex package (v0.3.0)

Upvotes: 7

Dr. Flow
Dr. Flow

Reputation: 486

 library(tidyverse)

 string <- data.frame(test="does this WORK")

 string$new <-str_extract_all(string$test, "[A-Z]+")

 string

           test  new
1 does this WORK WORK

Upvotes: 0

Related Questions