Reputation: 4660

How to extract specific content in a pandas dataframe with a regex?

Consider the following pandas dataframe:

In [114]:

df['movie_title'].head()


Out[114]:

0     Toy Story (1995)
1     GoldenEye (1995)
2    Four Rooms (1995)
3    Get Shorty (1995)
4       Copycat (1995)
...
Name: movie_title, dtype: object

Update: I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b. So I tried the following:

df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']

However, I get the following:

0       NaN
1       NaN
2       NaN
3       NaN
4       NaN
5       NaN
6       NaN
7       NaN
8       NaN

Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:

Out[114]:

0     Toy Story
1     GoldenEye
2    Four Rooms
3    Get Shorty
4       Copycat
...
Name: movie_title, dtype: object

Upvotes: 29

Answers (4)

Joselin Ceron

Reputation: 502

I wanted to extract the text after the symbol "@" and before the symbol "." (period) I tried this, it worked more or less because I have the symbol "@" but I don not want this symbol, anyway:

df['col'].astype(str).str.extract('(@.+.+)

Upvotes: 1

Gqndhi

Reputation: 1

Using regular expressions to find a year stored between parentheses. We specify the parantheses so we don't conflict with movies that have years in their titles

movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)

Removing the parentheses:

movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)

Removing the years from the 'title' column:

movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')

Applying the strip function to get rid of any ending whitespace characters that may have appeared:

movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())

Upvotes: -1

su79eu7k

Reputation: 7326

You should assign text group(s) with () like below to capture specific part of it.

new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']

pandas.core.strings.StringMethods.extract

StringMethods.extract(pat, flags=0, **kwargs)

Find groups in each string using passed regular expression

Upvotes: 10

jezrael

Reputation: 863801

You can try str.extract and strip, but better is use str.split, because in names of movies can be numbers too. Next solution is replace content of parentheses by regex and strip leading and trailing whitespaces:

#convert column to string
df['movie_title'] = df['movie_title'].astype(str)

#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
          movie_title      titles      titles1      titles2
0  Toy Story 2 (1995)   Toy Story  Toy Story 2  Toy Story 2
1    GoldenEye (1995)   GoldenEye    GoldenEye    GoldenEye
2   Four Rooms (1995)  Four Rooms   Four Rooms   Four Rooms
3   Get Shorty (1995)  Get Shorty   Get Shorty   Get Shorty
4      Copycat (1995)     Copycat      Copycat      Copycat

Upvotes: 61

How to extract specific content in a pandas dataframe with a regex?

Answers (4)

Related Questions