Reputation: 29
I have a dataset with many long complicated names that I am trying to clean up. I want to for example change Coal|w/ CCS
to just Coal CCS
and Coal|w/o CCS
to just Coal
.
I have tried the following code to clean it up.
Pr_ene<- Pr_en %>%
mutate(Variable = case_when(
str_detect(Variable, "\bPrimary Energy|Coal|w/ CCS\b") ~ "Coal CCS",
str_detect(Variable, "\bPrimary Energy|Coal|w/o CCS\b") ~ "Coal",
str_detect(Variable, "\bPrimary Energy|Gas|w/ CCS\b") ~ "Gas CCS",
str_detect(Variable, "\bPrimary Energy|Gas|w/o CCS\b") ~ "Gas",
str_detect(Variable, "Nuclear") ~ "Nuclear", ))
But I end up with something looking like this
Scenario Variable X2005 X2010 X2020 X2030 X2040 X2050 X2060 X2070 X2080 X2090 X2100
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 xxx1-00 Coal CCS 0 0 0 9.82 22.2 38.2 32.1 15.5 3.92 0.771 0.04
2 xxx2-00 Coal CCS 121. 140. 136. 25.4 1.78 0.146 0.054 0.032 0.02 0.012 0.007
3 xxx3-00 Gas CCS 0 0 0 12.6 50.0 53.1 33.3 14.4 4.03 0.004 0.002
4 xxx4-00 Gas CCS 100. 106. 122. 127. 89.6 74.7 63.9 54.5 52.3 50.1 52.8
5 xxx5-00 Nuclear
Original dataset:
Scenario Variable X2005 X2010 X2020 X2030 X2040 X2050 X2060 X2070
1 xxx1-00 Primary Energy|Coal|w/ CCS 0.0000 0.0000 0.0000 9.8220 22.164 38.1680 32.1030 15.4880
2 xxx2-00 Primary Energy|Coal|w/o CCS 121.2930 140.0090 136.4170 25.4090 1.781 0.1460 0.0540 0.0320
3 xxx3-00 Primary Energy|Gas|w/ CCS 0.0000 0.0000 0.0000 12.6250 50.007 53.0870 33.2860 14.4470
4 xxx4-00 Primary Energy|Gas|w/o CCS 100.4820 105.5650 122.0080 127.3080 89.555 74.7170 63.8960 54.4620
5 xxx5-00 Primary Energy|Nuclear 9.9660
The result should be for scenario 1 Coal CCS, 2 Coal, 3 Gas CCS, 4 Gas 5 Nuclear I have also looked through similar questions but I have not been able to fix my problem... I appreciate anyone's help, thanks in advance!
Upvotes: 0
Views: 439
Reputation: 145765
Regex (like you're using when you use str_detect
) is for looking within strings for patterns. You are mostly doing exact matching - find the entire string "Primary Energy|Coal|w/ CCS"
, replace it with "Coal CCS"
. For this, we don't need regex because we are looking and modifying whole strings, not parts of strings.
I would use a look-up table and a join:
energy_lookup = tribble(
~Variable, ~Result,
"Primary Energy|Coal|w/ CCS", "Coal CCS",
"Primary Energy|Coal|w/o CCS", "Coal",
"Primary Energy|Gas|w/ CCS", "Gas CCS",
"Primary Energy|Gas|w/o CCS", "Gas",
"Primary Energy|Nuclear", "Nuclear"
)
Pr_en %>% left_join(energy_lookup, by = "Variable")
If you had a lot more categories and did want to use regex to generalize patterns, I would probably do it in two stages - 1st extract the word between |
, 2nd paste on CCS
if the "w/ CCS"
is part of the string:
Pr_en %>%
mutate(
result = str_extract(Variable, pattern = "(?<=\\|)[^|]*"),
result = case_when(
str_detect(Variable, "w/ CCS") ~ paste(result, "CCS"),
TRUE ~ result
)
)
Explanation of the pattern - "(?<=\\|)[^|]*"
(?<=\\|)
is a look-behind assertion. It is not part of the match, but it ensures the match is preceded by a |
. (And because |
is a metacharacter for regex, we need to escape it with \\
unless it appears in brackets)[^|]*
matches any number of non-|
characters. ^
(in brackets) means "not", and *
means "any number".Upvotes: 1
Reputation: 2364
The |
has a special meaning in regular expressions. This will work:
library(dplyr)
library(stringr)
Pr_en <- tibble(
Variable = c("Primary Energy|Coal|w/ CCS", "Primary Energy|Coal|w/o CCS", "Primary Energy|Gas|w/ CCS", "Primary Energy|Gas|w/o CCS", "Primary Energy|Nuclear")
)
Pr_ene<- Pr_en %>%
mutate(
Variable = case_when(
str_detect(Variable, "Coal.w/ CCS") ~ "Coal CCS",
str_detect(Variable, "Coal.w/o CCS") ~ "Coal",
str_detect(Variable, "Gas.w/ CCS") ~ "Gas CCS",
str_detect(Variable, "Gas.w/o CCS") ~ "Gas",
str_detect(Variable, "Nuclear") ~ "Nuclear",
)
)
Pr_ene
Here, .
is a placeholder.
Upvotes: 1