Remove non-unique string components from a column in R

Question

example <- data.frame(
  file_name = c("some_file_name_first_2020.csv", 
                "some_file_name_second_and_third_2020.csv",
                "some_file_name_4_2020_update.csv"),
  a = 1:3
)

example
#>                                  file_name a
#> 1            some_file_name_first_2020.csv 1
#> 2 some_file_name_second_and_third_2020.csv 2
#> 3         some_file_name_4_2020_update.csv 3

I have a dataframe that looks something like this example. The "some_file_name" part changes often and the unique identifier is usually in the middle and there can be suffixed information (sometimes) that is important to retain.

I would like to end up with the dataframe below. The approach I can think of is finding all common string "components" and removing them from each row.

desired
#>          file_name a
#> 1            first 1
#> 2 second_and_third 2
#> 3         4_update 3

Ronak Shah · Accepted Answer

This works for the example shared, perhaps you can use this to make a more general solution :

#split the data on "_" or "."
list_data <- strsplit(example$file_name, '_|\.')
#Get the words that occur only once
unique_words <- names(Filter(function(x) x==1,  table(unlist(list_data))))
#Keep only unique_words and paste the string back.
sapply(list_data, function(x) paste(x[x %in% unique_words], collapse = "_"))

#[1] "first"            "second_and_third" "4_update"

However, this answer relies on the fact that you would have separators like "_" in the filenames to detect each "component".

Remove non-unique string components from a column in R

Answers (1)

Related Questions