Reputation: 1549
I have the following pandas dataframe:
pandas_dataframe = pd.DataFrame({'movie': ['Discreet Charm of the Bourgeoisie, The (Charme discret de la bourgeoisie, Le)',
'Attack Force Z (a.k.a. The Z Men) (Z-tzu te kung tui)',
'State of Things, The (Stand der Dinge, Der)',
'Happy Tour, A',
'Awfully Big Adventure, An',
'American President, The'],
'genre': ['Action', 'Comedy', 'Drama', 'Children', 'Action', 'Documentary']})
pandas_dataframe
I want to apply two transformations:
My final dataframe should look like this:
+---------------------------------------+------------+
| movie | genre |
+---------------------------------------+------------+
| The Discreet Charm of the Bourgeoisie | Action |
| Attack Force Z | Comedy |
| The State of Things | Drama |
| A Happy Tour | Children |
| An Awfully Big Adventure | Action |
| The American President | Documentary|
+---------------------------------------+------------+
I know that for the first transformation a regex expression should be applied. Although when I try the following,
exp = r'\([^]*\)'
pandas_dataframe['movie'] = pandas_dataframe['movie'].apply(lambda x: re.sub(exp,"",x).strip())
I get this error: error: unterminated character set at position 2
In my latest edit, I added some more occasions of movies with An, A words to change position. I apologize for not including them in first place.
Upvotes: 1
Views: 93
Reputation: 402813
Original requirements:
This moves "The" to its correct position and removes stuff within parentheses in a single expression:
df['movie'].str.replace(r'(.*?),?\s*(The)?\s*\(.*\)\s*', r'\2 \1')
0 The Discreet Charm of the Bourgeoisie
1 Attack Force Z
2 The State of Things
Name: movie, dtype: object
The Regex
(.*?) # The actual movie title - first capture group
,? # Optional comma (preceeding "The")
\s* # Whitespace
(The)? # Optional "The" - second capture group
\s*
\(.*\) # Stuff within parentheses we don't need
\s*
Updated requirements:
To support additional articles, let's do*
df['movie'].str.replace(r'(.*?),?\s*(The|A|An)?(?=\s*\(.*\)\s*|$).*', r'\2 \1')
0 The Discreet Charm of the Bourgeoisie
1 Attack Force Z
2 The State of Things
3 A Happy Tour
4 An Awfully Big Adventure
5 The American President
Name: movie, dtype: object
@Wiktor might have a shorter method to do this.
Upvotes: 3