Reputation: 433
I have a data set with individuals and their birth countries. However, some of the people were born in a time where Yugoslavia, Austrian Empire, Prussia etc., existed, so in the columns values, the current country is in brackets. How can I keep only the country in the brackets so that I can later group my data by country?
Person Birth Country
Nick Prussia (Germany)
Mike Germany
Maria Canada
Mark Russian Empire (Poland)
Sven Germany
Jarek Poland
Upvotes: 2
Views: 213
Reputation: 388862
You can remove everything until opening brackets and a closing bracket (if they exist) :
gsub('.*\\(|\\)', '', df$Birth_Country)
#[1] "Germany" "Germany" "Canada" "Poland" "Germany" "Poland"
Upvotes: 1
Reputation: 887008
We can use sub
to extract the characters that are not )
after a (
as a capture group and in the replacement specify the backreference (\\1
) of the captured group
df1$Country <- sub(".*\\(([^)]+)\\).*", "\\1", df1$`Birth Country`)
df1$Country
#[1] "Germany" "Germany" "Canada" "Poland" "Germany" "Poland"
The pattern we are matching is .*
(any character) followed by a literal (
(escape -\\(
- as it is a metacharacter), then capture the characters as a group ((...)
) that are not a )
([^)]+
), followed by )
(\\)
) and any other characters (.*
)
df1 <- structure(list(Person = c("Nick", "Mike", "Maria", "Mark", "Sven",
"Jarek"), `Birth Country` = c("Prussia (Germany)", "Germany",
"Canada", "Russian Empire (Poland)", "Germany", "Poland")),
class = "data.frame", row.names = c(NA,
-6L))
Upvotes: 1