Trying to remove duplicate strings from single observation observation to limit amount of factors

Question

I'm having difficulty trying to reduce the amount of factors from aggregated data. Long story short, I grouped a variety of car damage repair data to understand what maintenance has been done on a car. The issue with this is that it now contains duplicate strings if a certain aspect of the car has been worked on multiple times.

I'm trying to do this using str_replace and regular expressions. I have found a way to remove duplicates, but it only spits out a vector rather than replacing each single observation from my data frame.

Example data can be found below:

UNITNUMBER <- c(1,2,3,4,5,6,7,8,9,10)
MAINTENANCE_TYPE <- c("ELECTRIC BODY ELECTRIC", "ELECTRIC ACCESSORY BODY BODY", "ACCESSORY BODY ACCESSORY", "BODY ELECTRIC",
                      "ACCESSORY CHASSIS ELECTRIC CHASSIS", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS CHASSIS BODY",
                      "ELECTRIC ACCESSORY ELECTRIC BODY BODY CHASSIS", "BODY","ELECTRIC ELECTRIC")

df<-  data.frame(UNITNUMBER,MAINTENANCE_TYPE)

I'd like the final output to be as follows in alphabetical order (if possible):

MAINTENANCE_TYPE <- c("BODY ELECTRIC", "ACCESSORY BODY ELECTRIC", "ACCESSORY BODY", "BODY ELECTRIC",
                      "ACCESSORY CHASSIS ELECTRIC", "ACCESSORY BODY ELECTRIC", "BODY CHASSIS",
                      "ACCESSORY BODY CHASSIS ELECTRIC", "BODY","ELECTRIC")

Is this possible?

I've tried all sorts of str_replace functions using regex and have been hitting my head against the wall! Any help is appreciated.

Ronak Shah · Accepted Answer

You can use regex here with gsub to find any repetitive words and remove them.

trimws(gsub("(\b\S+\b)(?=.*\1)", "", df$MAINTENANCE_TYPE, perl = TRUE))

# [1] "BODY ELECTRIC"  "ELECTRIC ACCESSORY  BODY"  "BODY ACCESSORY"                 
# [4] "BODY ELECTRIC" "ACCESSORY  ELECTRIC CHASSIS" "ACCESSORY BODY ELECTRIC"
# [7] "CHASSIS BODY"  "ACCESSORY ELECTRIC  BODY CHASSIS" "BODY"                        
#[10] "ELECTRIC"

Regex taken from here .

A standard approach would be to split the string on every word, get unique words and paste them together.

sapply(strsplit(as.character(df$MAINTENANCE_TYPE), "\s+"), function(x) 
             paste(sort(unique(x)), collapse = " "))

# [1] "BODY ELECTRIC"  "ACCESSORY BODY ELECTRIC"   "ACCESSORY BODY"         
# [4] "BODY ELECTRIC"  "ACCESSORY CHASSIS ELECTRIC" "ACCESSORY BODY ELECTRIC"
# [7] "BODY CHASSIS" "ACCESSORY BODY CHASSIS ELECTRIC" "BODY"                 
#[10] "ELECTRIC"

Trying to remove duplicate strings from single observation observation to limit amount of factors

Answers (1)

Related Questions