NickA
NickA

Reputation: 433

How to keep certain parts of a string in R

I have a data set with individuals and their birth countries. However, some of the people were born in a time where Yugoslavia, Austrian Empire, Prussia etc., existed, so in the columns values, the current country is in brackets. How can I keep only the country in the brackets so that I can later group my data by country?

Person          Birth Country
 Nick         Prussia (Germany)
 Mike             Germany
 Maria            Canada
 Mark          Russian Empire (Poland)         
 Sven             Germany
 Jarek            Poland   

Upvotes: 2

Views: 213

Answers (2)

Ronak Shah
Ronak Shah

Reputation: 388862

You can remove everything until opening brackets and a closing bracket (if they exist) :

gsub('.*\\(|\\)', '', df$Birth_Country)
#[1] "Germany" "Germany" "Canada"  "Poland"  "Germany" "Poland"

Upvotes: 1

akrun
akrun

Reputation: 887008

We can use sub to extract the characters that are not ) after a ( as a capture group and in the replacement specify the backreference (\\1) of the captured group

df1$Country <- sub(".*\\(([^)]+)\\).*", "\\1", df1$`Birth Country`)
df1$Country
#[1] "Germany" "Germany" "Canada"  "Poland"  "Germany" "Poland" 

The pattern we are matching is .* (any character) followed by a literal ( (escape -\\( - as it is a metacharacter), then capture the characters as a group ((...)) that are not a ) ([^)]+), followed by ) (\\)) and any other characters (.*)

data

df1 <- structure(list(Person = c("Nick", "Mike", "Maria", "Mark", "Sven", 
"Jarek"), `Birth Country` = c("Prussia (Germany)", "Germany", 
"Canada", "Russian Empire (Poland)", "Germany", "Poland")),
class = "data.frame", row.names = c(NA, 
-6L))

Upvotes: 1

Related Questions