R: Cleaning a string using a list of wanted substrings

Question

I have a dataframe with a string

Clause <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High', 'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
Input <- data.frame(Clause)

I would like to clean that string, by retaining only substrings found within a cleaning list;

Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')

The desired result is below.

Wanted <- c('Pb Anomaly Low', 'Fe Anomaly High', 'Pb Anomaly High & Fe Anomaly Low')
Result <- data.frame(Wanted)

Note: The 'Keepers' list will also contain items such as 'SomethingNotPresent'

Ronak Shah · Accepted Answer

You may split the string at each word and keep only Keepers words for each row.

sapply(strsplit(Input$Clause, '\s+'), function(x) 
       paste0(x[x %in% Keepers], collapse = ' '))

#[1] "Pb Anomaly Low"     "Fe Anomaly High"     "Anomaly High & Fe Anomaly Low"

R: Cleaning a string using a list of wanted substrings

Answers (2)

Related Questions