shashank kumar
shashank kumar

Reputation: 95

Python regex to get everything until an expression like ''(year)"

I have a data frame column named 'movie_title' which has movie names along with year. Following are two types of movie titles in the above mentioned column.

title1='Toy Story (1995)'
title2='City of Lost Children, The (Cité des enfants perdus, La) (1995)'

I want to split this into two columns with title and release year. I was able to extract years successfully using following regex:

re.findall('[1-2][0-9]{3}', string)[0]

Need help in writing another regex which can extract titles(excluding year info along with brackets).

e.g. title1 and title2 should look like:

title1='Toy Story'
title2='City of Lost Children, The (Cité des enfants perdus, La)'

Upvotes: 3

Views: 711

Answers (3)

Messa
Messa

Reputation: 25181

>>> titles = [
...     'Toy Story (1995)',
...     'City of Lost Children, The (Cité des enfants perdus, La) (1995)',
... ]
>>>
>>> import re
>>>
>>> for title in titles:
...     m = re.match(r'^(.*) \((19\d\d|20\d\d)\)$', title)
...     name, year = m.groups()
...     print(f'name: {repr(name)} year: {repr(year)}')
...
name: 'Toy Story' year: '1995'
name: 'City of Lost Children, The (Cité des enfants perdus, La)' year: '1995'

Explanation of ^(.*) \((19\d\d|20\d\d)\)$ from regex101.com:

explanation

Upvotes: 1

wade king
wade king

Reputation: 363

to get the year and eliminate the parenthesis at the end use the regex "find the first string with at least one digit followed by a parenthesis" the regex looks like this: '\d+(?=\))'

1.) \d means find a digit, the + infront means find at least one of these

2.) (?=) means followed by. \) means the character ')'. so (?=\)) means followed by a ')'

3.) putting these all together means a string of at least one digit followed by ')'

INPUT: City of Lost Children, The (Cité des enfants perdus, La) (1995)

OUTPUT: 1995

to get the movie tittle use the regex "get the first string with any number of non digits followed by '(' looks like this: '\D*(?=\()'

1.)/D means non digit. with * it means any number of non digits

2.)again we see (?=\() means followed by '('

3.) all together it means any number of non digits followed by '('

INPUT: City of Lost Children, The (Cité des enfants perdus, La) (1995)

OUTPUT: City of Lost Children, The (Cité des enfants perdus, La)

note: the regex for getting the tittle assumes there are no digits in the tittle.

Upvotes: 1

peachykeen
peachykeen

Reputation: 4411

This does the trick almost:

.(?:[^\((0-9)])+

You just need to get rid of the trailing ) that it doesn't capture. Will update this answer if I find anything better.

Another thought: If you are sure that the year will appear at the end of every movie title, why not just strip the last bit off? So remove (xxxx) off of every movie string you have?

Upvotes: 1

Related Questions