Reputation: 95
I have a data frame column named 'movie_title' which has movie names along with year. Following are two types of movie titles in the above mentioned column.
title1='Toy Story (1995)'
title2='City of Lost Children, The (Cité des enfants perdus, La) (1995)'
I want to split this into two columns with title and release year. I was able to extract years successfully using following regex:
re.findall('[1-2][0-9]{3}', string)[0]
Need help in writing another regex which can extract titles(excluding year info along with brackets).
e.g. title1 and title2 should look like:
title1='Toy Story'
title2='City of Lost Children, The (Cité des enfants perdus, La)'
Upvotes: 3
Views: 711
Reputation: 25181
>>> titles = [
... 'Toy Story (1995)',
... 'City of Lost Children, The (Cité des enfants perdus, La) (1995)',
... ]
>>>
>>> import re
>>>
>>> for title in titles:
... m = re.match(r'^(.*) \((19\d\d|20\d\d)\)$', title)
... name, year = m.groups()
... print(f'name: {repr(name)} year: {repr(year)}')
...
name: 'Toy Story' year: '1995'
name: 'City of Lost Children, The (Cité des enfants perdus, La)' year: '1995'
Explanation of ^(.*) \((19\d\d|20\d\d)\)$
from regex101.com:
Upvotes: 1
Reputation: 363
to get the year and eliminate the parenthesis at the end use the regex "find the first string with at least one digit followed by a parenthesis" the regex looks like this: '\d+(?=\))'
1.) \d means find a digit, the + infront means find at least one of these
2.) (?=) means followed by. \) means the character ')'. so (?=\)) means followed by a ')'
3.) putting these all together means a string of at least one digit followed by ')'
INPUT: City of Lost Children, The (Cité des enfants perdus, La) (1995)
OUTPUT: 1995
to get the movie tittle use the regex "get the first string with any number of non digits followed by '(' looks like this: '\D*(?=\()'
1.)/D means non digit. with * it means any number of non digits
2.)again we see (?=\() means followed by '('
3.) all together it means any number of non digits followed by '('
INPUT: City of Lost Children, The (Cité des enfants perdus, La) (1995)
OUTPUT: City of Lost Children, The (Cité des enfants perdus, La)
note: the regex for getting the tittle assumes there are no digits in the tittle.
Upvotes: 1
Reputation: 4411
This does the trick almost:
.(?:[^\((0-9)])+
You just need to get rid of the trailing )
that it doesn't capture. Will update this answer if I find anything better.
Another thought: If you are sure that the year will appear at the end of every movie title, why not just strip the last bit off? So remove (xxxx)
off of every movie string you have?
Upvotes: 1