NikSp
NikSp

Reputation: 1549

remove text between parenthesis and change the position of words inside a string

I have the following pandas dataframe:

pandas_dataframe = pd.DataFrame({'movie': ['Discreet Charm of the Bourgeoisie, The (Charme discret de la bourgeoisie, Le)',
                                           'Attack Force Z (a.k.a. The Z Men) (Z-tzu te kung tui)',
                                           'State of Things, The (Stand der Dinge, Der)',
                                           'Happy Tour, A',
                                           'Awfully Big Adventure, An',
                                           'American President, The'],
                                 'genre': ['Action', 'Comedy', 'Drama', 'Children', 'Action', 'Documentary']})
pandas_dataframe

I want to apply two transformations:

My final dataframe should look like this:

+---------------------------------------+------------+
| movie                                 | genre      |
+---------------------------------------+------------+
| The Discreet Charm of the Bourgeoisie | Action     |
| Attack Force Z                        | Comedy     |
| The State of Things                   | Drama      |
| A Happy Tour                          | Children   |
| An Awfully Big Adventure              | Action     |
| The American President                | Documentary|
+---------------------------------------+------------+

I know that for the first transformation a regex expression should be applied. Although when I try the following,

exp = r'\([^]*\)'
pandas_dataframe['movie'] = pandas_dataframe['movie'].apply(lambda x: re.sub(exp,"",x).strip())

I get this error: error: unterminated character set at position 2

In my latest edit, I added some more occasions of movies with An, A words to change position. I apologize for not including them in first place.

Upvotes: 1

Views: 93

Answers (1)

cs95
cs95

Reputation: 402813

Original requirements:

  • move "The" to the start of the sentence
  • remove text in parentheses

This moves "The" to its correct position and removes stuff within parentheses in a single expression:

df['movie'].str.replace(r'(.*?),?\s*(The)?\s*\(.*\)\s*', r'\2 \1')

0    The Discreet Charm of the Bourgeoisie
1                           Attack Force Z
2                      The State of Things
Name: movie, dtype: object

The Regex

(.*?)   # The actual movie title - first capture group
,?      # Optional comma (preceeding "The")
\s*     # Whitespace
(The)?  # Optional "The" - second capture group
\s*    
\(.*\)  # Stuff within parentheses we don't need
\s*

Updated requirements:

  • move "A", "An", "The" to the start of the sentence
  • remove text in parentheses if present

To support additional articles, let's do*

df['movie'].str.replace(r'(.*?),?\s*(The|A|An)?(?=\s*\(.*\)\s*|$).*', r'\2 \1')  

0    The Discreet Charm of the Bourgeoisie 
1                           Attack Force Z 
2                      The State of Things 
3                             A Happy Tour 
4                 An Awfully Big Adventure 
5                   The American President 
Name: movie, dtype: object

@Wiktor might have a shorter method to do this.

Upvotes: 3

Related Questions