Reputation: 1571
I have the following strings
string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)" )
In this string, I want to remove everything except the ``middle" sentence.
My expected result should look like this:
excpected_string <- c("Latin America & Caribbean", "North America" )
Can someone help me how I can do this using gsub
Upvotes: 0
Views: 451
Reputation: 26343
Another idea
trimws(sub(".*–([^\\(]+).*", "\\1", string))
# [1] "Latin America & Caribbean" "North America"
Removes everything up to and including –
as well as what follows an opening bracket (
. We use a capture group to isolate the desired output. trimws
removes leading and trailing whitespaces.
Upvotes: 1
Reputation: 69151
You can do this with a regular expression. Based on the two examples, the two patterns I identified were 1) remove everything before -
, and 2) remove everything within parens ()
.
Here's one solution to do that:
string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)" )
gsub("^.*\\s–\\s|\\s*\\([^\\)]+\\)", "", string)
#> [1] "Latin America & Caribbean" "North America"
Created on 2019-03-10 by the reprex package (v0.2.1)
The first part of the regex ^.*\\s–\\s
says "grab all the characters from the start of the string before we find -
".
In regex, the |
means OR, so the second regex \\s*\\([^\\)]+\\
identifies all text (and leading / trailing spaces) inside parens. Credit to this question for that regex.
Upvotes: 1