msh855
msh855

Reputation: 1571

Remove a string except words in specific position in R

I have the following strings

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )

In this string, I want to remove everything except the ``middle" sentence.

My expected result should look like this:

excpected_string <- c("Latin America & Caribbean", "North America"  )

Can someone help me how I can do this using gsub

Upvotes: 0

Views: 451

Answers (2)

markus
markus

Reputation: 26343

Another idea

trimws(sub(".*–([^\\(]+).*", "\\1", string))
# [1] "Latin America & Caribbean" "North America" 

Removes everything up to and including as well as what follows an opening bracket (. We use a capture group to isolate the desired output. trimws removes leading and trailing whitespaces.

Upvotes: 1

Chase
Chase

Reputation: 69151

You can do this with a regular expression. Based on the two examples, the two patterns I identified were 1) remove everything before -, and 2) remove everything within parens ().

Here's one solution to do that:

string <- c("Trade (% of GDP) – Latin America & Caribbean (WB/WDI/NE.TRD.GNFS.ZS-ZJ)", "Trade (% of GDP) – North America (WB/WDI/NE.TRD.GNFS.ZS-XU)"  )
gsub("^.*\\s–\\s|\\s*\\([^\\)]+\\)", "", string)
#> [1] "Latin America & Caribbean" "North America"

Created on 2019-03-10 by the reprex package (v0.2.1)

The first part of the regex ^.*\\s–\\s says "grab all the characters from the start of the string before we find -".

In regex, the | means OR, so the second regex \\s*\\([^\\)]+\\ identifies all text (and leading / trailing spaces) inside parens. Credit to this question for that regex.

Upvotes: 1

Related Questions