Separating line of text into columns of a dataframe

Question

I have a dataframe with lines of text that look like the following:

         ANTALYA (GB) ch. 1960
    SHOOTIN WAR (USA) ch. 1998
    LORD AT WAR (ARG) ch. 1980

The all caps are names, then location in (), color abbreviation, year. Names can be multiple words. I want to separate this single block of text into each component: Name, location, color, year. I have been fighting with this for several days, and the best working solution I have is to just put every word into separate columns, but it only works if the names are all a certain length... For what I'm doing with the data, I can use it in this form but it just doesn't look nice, you know?

sepdf <- df %>% 
           separate(pedigree, into=c("Name1", "Name2", "Loc", "Col", "Year"), 
                    sep=" ", merge=TRUE)

I tried just keeping the name by using the "(" as a separator between 2 columns, but I don't think R likes that I'm trying to use a parentheses as a delimiter...

Any suggestions would be much, much appreciated.

talat · Accepted Answer

For more complicated pattern matching like yours you can use tidyr's extract function which lets you create regex capture groups. Each group is inside a set of parenthesis (()):

library(tidyr)
extract(df, pedigree, into = c("Name", "Loc", "Col", "Year"), 
           regex = "^([A-Z ]+) $(.*)$ ([a-z]+\.) (\d+)$")
         Name Loc Col Year
1     ANTALYA  GB ch. 1960
2 SHOOTIN WAR USA ch. 1998
3 LORD AT WAR ARG ch. 1980

The regex I used here is:

^ beginning of the string
([A-Z ]+) first group contains multiple capital letters and spaces
$ then there's a space and an opening parenthesis (escaped with $
(.*) the second group is anything in the parenthesis
\) followed by a closing parenthesis and a space
([a-z]+\.) third group contains lower case letters and a dot
(\d+) then a space and the fourth group contains only numbers
$ end of the string

Separating line of text into columns of a dataframe

Answers (1)

Related Questions