milk
milk

Reputation: 123

Splitting a column in a data frame?

I've got this data frame with data from IMDb in it. One of the columns has the movie title with the year attached in parentheses. Looks like this:

The Shawshank Redemption (1994)

What I really want is to have the title and year separate. I've tried a couple of different things (split, strsplit), but I've had no success. I try to split on the first parentheses, but the two split functions don't seem to like non-character arguments. Anyone have any thoughts?

Upvotes: 2

Views: 1071

Answers (4)

Ananta
Ananta

Reputation: 3711

tidyr solution

df%>%separate(col,c("name", "year"), "[()]")

Thanks to Avinash, I can take his regular expression and apply in tidyr

m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")

                        name year
1   The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3       Kung(fu) Pa (23) nda 2010

Upvotes: 2

FelixNNelson
FelixNNelson

Reputation: 171

Try the following code:

t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))

The above code will work if you just pass in the column of your data frame containing the title.

Upvotes: 0

Avinash Raj
Avinash Raj

Reputation: 174696

If you want to do an exact splitting (ie, splitting on the brcakets which exists at the last), you may try this.

x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"                    

# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"

Upvotes: 3

akrun
akrun

Reputation: 887048

The strsplit works on character columns. So, if the column is factor class, we need to convert it to character class (as.character(..)). Here, I matching zero or more space (\\s*) followed by parenetheses (\\() or | the closing parentheses (\\)) to split

strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"         

Or we can place the parentheses inside [] so that we don't have to escape \\ (as commented by @Avinash Raj)

strsplit(as.character(d1$v1), '\\s*[()]')[[1]]

data

v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)

Upvotes: 7

Related Questions