Reputation: 22
I'm having difficulty trying to reduce the amount of factors from aggregated data. Long story short, I grouped a variety of car damage repair data to understand what maintenance has been done on a car. The issue with this is that it now contains duplicate strings if a certain aspect of the car has been worked on multiple times.
I'm trying to do this using str_replace
and regular expressions. I have found a way to remove duplicates, but it only spits out a vector rather than replacing each single observation from my data frame.
Example data can be found below:
UNITNUMBER <- c(1,2,3,4,5,6,7,8,9,10)
MAINTENANCE_TYPE <- c("ELECTRIC BODY ELECTRIC", "ELECTRIC ACCESSORY BODY BODY", "ACCESSORY BODY ACCESSORY", "BODY ELECTRIC",
"ACCESSORY CHASSIS ELECTRIC CHASSIS", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS CHASSIS BODY",
"ELECTRIC ACCESSORY ELECTRIC BODY BODY CHASSIS", "BODY","ELECTRIC ELECTRIC")
df<- data.frame(UNITNUMBER,MAINTENANCE_TYPE)
I'd like the final output to be as follows in alphabetical order (if possible):
MAINTENANCE_TYPE <- c("BODY ELECTRIC", "ACCESSORY BODY ELECTRIC", "ACCESSORY BODY", "BODY ELECTRIC",
"ACCESSORY CHASSIS ELECTRIC", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS",
"ACCESSORY BODY CHASSIS ELECTRIC", "BODY","ELECTRIC")
Is this possible?
I've tried all sorts of str_replace functions using regex and have been hitting my head against the wall! Any help is appreciated.
Upvotes: 0
Views: 38
Reputation: 388862
You can use regex
here with gsub
to find any repetitive words and remove them.
trimws(gsub("(\\b\\S+\\b)(?=.*\\1)", "", df$MAINTENANCE_TYPE, perl = TRUE))
# [1] "BODY ELECTRIC" "ELECTRIC ACCESSORY BODY" "BODY ACCESSORY"
# [4] "BODY ELECTRIC" "ACCESSORY ELECTRIC CHASSIS" "ACCESSORY BODY ELECTRIC"
# [7] "CHASSIS BODY" "ACCESSORY ELECTRIC BODY CHASSIS" "BODY"
#[10] "ELECTRIC"
Regex taken from here .
A standard approach would be to split the string on every word, get unique
words and paste
them together.
sapply(strsplit(as.character(df$MAINTENANCE_TYPE), "\\s+"), function(x)
paste(sort(unique(x)), collapse = " "))
# [1] "BODY ELECTRIC" "ACCESSORY BODY ELECTRIC" "ACCESSORY BODY"
# [4] "BODY ELECTRIC" "ACCESSORY CHASSIS ELECTRIC" "ACCESSORY BODY ELECTRIC"
# [7] "BODY CHASSIS" "ACCESSORY BODY CHASSIS ELECTRIC" "BODY"
#[10] "ELECTRIC"
Upvotes: 2