Reputation: 680
I have a dataframe with a string
Clause <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High', 'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
Input <- data.frame(Clause)
I would like to clean that string, by retaining only substrings found within a cleaning list;
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
The desired result is below.
Wanted <- c('Pb Anomaly Low', 'Fe Anomaly High', 'Pb Anomaly High & Fe Anomaly Low')
Result <- data.frame(Wanted)
Note: The 'Keepers' list will also contain items such as 'SomethingNotPresent'
Upvotes: 1
Views: 23
Reputation: 388862
You may split the string at each word and keep only Keepers
words for each row.
sapply(strsplit(Input$Clause, '\\s+'), function(x)
paste0(x[x %in% Keepers], collapse = ' '))
#[1] "Pb Anomaly Low" "Fe Anomaly High" "Anomaly High & Fe Anomaly Low"
Upvotes: 2
Reputation: 520968
You may form a regex alternation of whitelisted terms to keep. Then use a negative lookahead pattern to identify all terms/whitespace which should be removed:
alternation <- paste(Keepers, collapse="|")
regex <- paste0("\\s*(?!(?:", alternation, "))(?<!\\S)\\S+(?!\\S)\\s*")
df$clause <- gsub("\\s+", " ", trimws(gsub(regex, " ", df$clause, perl=TRUE)))
df
clause
1 Pb Anomaly Low
2 Fe Anomaly High
3 Pb Anomaly High & Fe Anomaly Low
Data:
inp <- c('Big Truck Pb Anomaly Low', 'Red Truck Fe Anomaly High',
'Blue Truck Pb Anomaly High & Old Truck Fe Anomaly Low')
df <- data.frame(clause=inp, stringsAsFactors=FALSE)
Keepers <- c('Anomaly', 'Low', 'High', '&', 'Pb', 'Fe', ' ', 'SomethingNotPresent')
Upvotes: 0