Remove a string except words in specific position in R

Question

I have the following strings

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )

In this string, I want to remove everything except the ``middle" sentence.

My expected result should look like this:

excpected_string <- c("Latin America & Caribbean", "North America"  )

Can someone help me how I can do this using gsub

Chase · Accepted Answer

You can do this with a regular expression. Based on the two examples, the two patterns I identified were 1) remove everything before -, and 2) remove everything within parens ().

Here's one solution to do that:

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )
gsub("^.*\s–\s|\s*$[^$]+\)", "", string)
#> [1] "Latin America & Caribbean" "North America"

^{Created on 2019-03-10 by the reprex package (v0.2.1)}

The first part of the regex ^.*\s–\s says "grab all the characters from the start of the string before we find -".

In regex, the | means OR, so the second regex \s*$[^$]+\ identifies all text (and leading / trailing spaces) inside parens. Credit to this question for that regex.

Remove a string except words in specific position in R

Answers (2)

Related Questions