Reputation: 123
I've got this data frame with data from IMDb in it. One of the columns has the movie title with the year attached in parentheses. Looks like this:
The Shawshank Redemption (1994)
What I really want is to have the title and year separate. I've tried a couple of different things (split, strsplit), but I've had no success. I try to split on the first parentheses, but the two split functions don't seem to like non-character arguments. Anyone have any thoughts?
Upvotes: 2
Views: 1071
Reputation: 3711
tidyr
solution
df%>%separate(col,c("name", "year"), "[()]")
Thanks to Avinash, I can take his regular expression and apply in tidyr
m<-c("The Shawshank Redemption (1994)","The Shawshank (Redemption) (1994)", "Kung(fu) Pa (23) nda (2010)")
m2<-data.frame(m)
m2%>%separate(m,c("name", "year"), "\\s*\\((?=\\d+\\)$)|\\)$")
name year
1 The Shawshank Redemption 1994
2 The Shawshank (Redemption) 1994
3 Kung(fu) Pa (23) nda 2010
Upvotes: 2
Reputation: 171
Try the following code:
t(sapply(strsplit(c("The Shawshank Redemption (1994)"), '\\s*\\(|\\)'),rbind))
The above code will work if you just pass in the column of your data frame containing the title.
Upvotes: 0
Reputation: 174696
If you want to do an exact splitting (ie, splitting on the brcakets which exists at the last), you may try this.
x <- c("The Shawshank Redemption (1994)", "Kung(fu) Pa (23) nda (2010)")
strsplit(as.character(x), "\\s*\\((?=\\d+\\)$)|\\)$", perl=T)
# [[1]]
# [1] "The Shawshank Redemption" "1994"
# [[2]]
# [1] "Kung(fu) Pa (23) nda" "2010"
Upvotes: 3
Reputation: 887048
The strsplit
works on character
columns. So, if the column is factor
class, we need to convert it to character
class (as.character(..)
). Here, I matching zero or more space (\\s*
) followed by parenetheses (\\(
) or |
the closing parentheses (\\)
) to split
strsplit(as.character(d1$v1), '\\s*\\(|\\)')[[1]]
#[1] "The Shawshank Redemption" "1994"
Or we can place the parentheses inside []
so that we don't have to escape \\
(as commented by @Avinash Raj)
strsplit(as.character(d1$v1), '\\s*[()]')[[1]]
v1 <- 'The Shawshank Redemption (1994)'
d1 <- data.frame(v1)
Upvotes: 7