Reputation: 763
I have the following column in my data frame that contain charges
library(dplyr)
library(stringr)
df<-data.frame(charge=c("trespass-1st degree",
"trespass - 1st degree","rape or attempted rape - 1st degree",
"rape or attempt rape 1st degree","Assault 1st","Assault 1st"))
charge
1 trespass-1st degree
2 trespass - 1st degree
3 rape or attempted rape - 1st degree
4 rape or attempt rape 1st degree
5 Assault 1st
6 Assault 1st
I want to make sure that certain charges that have data entry errors are standardized. e.g
trespass-1st degree
vs trespass - 1st degree
and rape or attempted rape - 1st degree
vs rape or attempt rape 1st degree
I tried the following
df%>%
mutate(charge=
case_when(str_detect(charge, "^trespass-1st") ~ "Trespass 1st",
str_detect(charge,"^rape or attempted rape")~"Rape 1st"))
which gives me the following output
charge
1 Trespass 1st
2 <NA>
3 Rape 1st
4 <NA>
5 <NA>
6 <NA>
How do I make sure that if only two strings are present like "tresspass" and "1st" then that gets tagged as " Trespass 1st" and if "rape" and "1st" are present in the charge column then that gets tagged as "Rape 1st"
To get the following df
charge
1 Trespass 1st
2 Trespass 1st
3 Rape 1st
4 Rape 1st
5 Assault 1st
6 Assault 1st
Upvotes: 1
Views: 629
Reputation: 887028
The issue is that some elements doesn't have spaces (trespass-1st
vs trespass-1st
) or some suffix (attempt
vs attempted
)
library(dplyr)
df %>%
mutate(charge=
case_when(str_detect(charge, "^trespass\\s*-\\s*1st") ~
"Trespass 1st",
str_detect(charge,"^rape or attempte*d* rape")~"Rape 1st",
TRUE ~ charge))
# charge
#1 Trespass 1st
#2 Trespass 1st
#3 Rape 1st
#4 Rape 1st
#5 Assault 1st
#6 Assault 1st
df <- structure(list(charge = c("trespass-1st degree", "trespass - 1st degree",
"rape or attempted rape - 1st degree", "rape or attempt rape 1st degree",
"Assault 1st", "Assault 1st")), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
Upvotes: 1