Reputation: 65
I have a dataframe with lines of text that look like the following:
ANTALYA (GB) ch. 1960
SHOOTIN WAR (USA) ch. 1998
LORD AT WAR (ARG) ch. 1980
The all caps are names, then location in (), color abbreviation, year. Names can be multiple words. I want to separate this single block of text into each component: Name, location, color, year. I have been fighting with this for several days, and the best working solution I have is to just put every word into separate columns, but it only works if the names are all a certain length... For what I'm doing with the data, I can use it in this form but it just doesn't look nice, you know?
sepdf <- df %>%
separate(pedigree, into=c("Name1", "Name2", "Loc", "Col", "Year"),
sep=" ", merge=TRUE)
I tried just keeping the name by using the "(" as a separator between 2 columns, but I don't think R likes that I'm trying to use a parentheses as a delimiter...
Any suggestions would be much, much appreciated.
Upvotes: 4
Views: 77
Reputation: 70266
For more complicated pattern matching like yours you can use tidyr's extract
function which lets you create regex capture groups. Each group is inside a set of parenthesis (()
):
library(tidyr)
extract(df, pedigree, into = c("Name", "Loc", "Col", "Year"),
regex = "^([A-Z ]+) \\((.*)\\) ([a-z]+\\.) (\\d+)$")
Name Loc Col Year
1 ANTALYA GB ch. 1960
2 SHOOTIN WAR USA ch. 1998
3 LORD AT WAR ARG ch. 1980
The regex I used here is:
^
beginning of the string([A-Z ]+)
first group contains multiple capital letters and spaces\\(
then there's a space and an opening parenthesis (escaped with \)(.*)
the second group is anything in the parenthesis\\)
followed by a closing parenthesis and a space([a-z]+\\.)
third group contains lower case letters and a dot(\\d+)
then a space and the fourth group contains only numbers$
end of the stringUpvotes: 4