ILoveHateR
ILoveHateR

Reputation: 29

change string to simpler text using str_detect and dplyr

I have a dataset with many long complicated names that I am trying to clean up. I want to for example change Coal|w/ CCS to just Coal CCS and Coal|w/o CCS to just Coal. I have tried the following code to clean it up.

Pr_ene<- Pr_en %>% 
  mutate(Variable = case_when(
    str_detect(Variable, "\bPrimary Energy|Coal|w/ CCS\b") ~ "Coal CCS",
    str_detect(Variable, "\bPrimary Energy|Coal|w/o CCS\b") ~ "Coal",
    str_detect(Variable, "\bPrimary Energy|Gas|w/ CCS\b") ~ "Gas CCS",
    str_detect(Variable, "\bPrimary Energy|Gas|w/o CCS\b") ~ "Gas",
    str_detect(Variable, "Nuclear") ~ "Nuclear",         ))

But I end up with something looking like this

 Scenario Variable   X2005   X2010   X2020   X2030 X2040  X2050   X2060   X2070  X2080   X2090   X2100
   <chr>    <chr>      <dbl>   <dbl>   <dbl>   <dbl> <dbl>  <dbl>   <dbl>   <dbl>  <dbl>   <dbl>   <dbl>
 1 xxx1-00  Coal CCS   0       0       0       9.82  22.2  38.2    32.1    15.5     3.92   0.771   0.04 
 2 xxx2-00  Coal CCS 121.    140.    136.     25.4    1.78  0.146   0.054   0.032   0.02   0.012   0.007
 3 xxx3-00  Gas CCS    0       0       0      12.6   50.0  53.1    33.3    14.4     4.03   0.004   0.002
 4 xxx4-00  Gas CCS  100.    106.    122.    127.    89.6  74.7    63.9    54.5    52.3   50.1    52.8  
 5 xxx5-00  Nuclear   

Original dataset:

 Scenario                              Variable    X2005    X2010    X2020    X2030   X2040    X2050    X2060    X2070
1   xxx1-00            Primary Energy|Coal|w/ CCS   0.0000   0.0000   0.0000   9.8220  22.164  38.1680  32.1030  15.4880
2   xxx2-00           Primary Energy|Coal|w/o CCS 121.2930 140.0090 136.4170  25.4090   1.781   0.1460   0.0540   0.0320
3   xxx3-00             Primary Energy|Gas|w/ CCS   0.0000   0.0000   0.0000  12.6250  50.007  53.0870  33.2860  14.4470
4   xxx4-00            Primary Energy|Gas|w/o CCS 100.4820 105.5650 122.0080 127.3080  89.555  74.7170  63.8960  54.4620
5   xxx5-00                Primary Energy|Nuclear   9.9660 

The result should be for scenario 1 Coal CCS, 2 Coal, 3 Gas CCS, 4 Gas 5 Nuclear I have also looked through similar questions but I have not been able to fix my problem... I appreciate anyone's help, thanks in advance!

Upvotes: 0

Views: 439

Answers (2)

Gregor Thomas
Gregor Thomas

Reputation: 145765

Regex (like you're using when you use str_detect) is for looking within strings for patterns. You are mostly doing exact matching - find the entire string "Primary Energy|Coal|w/ CCS", replace it with "Coal CCS". For this, we don't need regex because we are looking and modifying whole strings, not parts of strings.

I would use a look-up table and a join:

energy_lookup = tribble(
  ~Variable, ~Result,
  "Primary Energy|Coal|w/ CCS", "Coal CCS",
  "Primary Energy|Coal|w/o CCS", "Coal",
  "Primary Energy|Gas|w/ CCS", "Gas CCS",
  "Primary Energy|Gas|w/o CCS", "Gas",
  "Primary Energy|Nuclear", "Nuclear"
)


Pr_en %>% left_join(energy_lookup, by = "Variable")

If you had a lot more categories and did want to use regex to generalize patterns, I would probably do it in two stages - 1st extract the word between |, 2nd paste on CCS if the "w/ CCS" is part of the string:

Pr_en %>%
  mutate(
    result = str_extract(Variable, pattern = "(?<=\\|)[^|]*"),
    result = case_when(
      str_detect(Variable, "w/ CCS") ~ paste(result, "CCS"),
      TRUE ~ result
    )
  )

Explanation of the pattern - "(?<=\\|)[^|]*"

  • (?<=\\|) is a look-behind assertion. It is not part of the match, but it ensures the match is preceded by a |. (And because | is a metacharacter for regex, we need to escape it with \\ unless it appears in brackets)
  • [^|]* matches any number of non-| characters. ^ (in brackets) means "not", and * means "any number".

Upvotes: 1

eastclintw00d
eastclintw00d

Reputation: 2364

The | has a special meaning in regular expressions. This will work:

library(dplyr)
library(stringr)
Pr_en <- tibble(
  Variable = c("Primary Energy|Coal|w/ CCS", "Primary Energy|Coal|w/o CCS", "Primary Energy|Gas|w/ CCS", "Primary Energy|Gas|w/o CCS", "Primary Energy|Nuclear")
)

Pr_ene<- Pr_en %>% 
  mutate(
    Variable = case_when(
      str_detect(Variable, "Coal.w/ CCS") ~ "Coal CCS",
      str_detect(Variable, "Coal.w/o CCS") ~ "Coal",
      str_detect(Variable, "Gas.w/ CCS") ~ "Gas CCS",
      str_detect(Variable, "Gas.w/o CCS") ~ "Gas",
      str_detect(Variable, "Nuclear") ~ "Nuclear",
    )
  )
Pr_ene

Here, . is a placeholder.

Upvotes: 1

Related Questions