Reputation: 4630
Consider the following pandas dataframe:
In [114]:
df['movie_title'].head()
Out[114]:
0 Toy Story (1995)
1 GoldenEye (1995)
2 Four Rooms (1995)
3 Get Shorty (1995)
4 Copycat (1995)
...
Name: movie_title, dtype: object
Update:
I would like to extract with a regular expression just the titles of the movies. So, let's use the following regex: \b([^\d\W]+)\b
. So I tried the following:
df_3['movie_title'] = df_3['movie_title'].str.extract('\b([^\d\W]+)\b')
df_3['movie_title']
However, I get the following:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
Any idea of how to extract specific features from text in a pandas dataframe?. More specifically, how can I extract just the titles of the movies in a completely new dataframe?. For instance, the desired output should be:
Out[114]:
0 Toy Story
1 GoldenEye
2 Four Rooms
3 Get Shorty
4 Copycat
...
Name: movie_title, dtype: object
Upvotes: 28
Views: 124160
Reputation: 502
I wanted to extract the text after the symbol "@" and before the symbol "." (period) I tried this, it worked more or less because I have the symbol "@" but I don not want this symbol, anyway:
df['col'].astype(str).str.extract('(@.+.+)
Upvotes: 1
Reputation: 1
Using regular expressions to find a year stored between parentheses. We specify the parantheses so we don't conflict with movies that have years in their titles
movies_df['year'] = movies_df.title.str.extract('(\(\d\d\d\d\))',expand=False)
Removing the parentheses:
movies_df['year'] = movies_df.year.str.extract('(\d\d\d\d)',expand=False)
Removing the years from the 'title' column:
movies_df['title'] = movies_df.title.str.replace('(\(\d\d\d\d\))', '')
Applying the strip function to get rid of any ending whitespace characters that may have appeared:
movies_df['title'] = movies_df['title'].apply(lambda x: x.strip())
Upvotes: -1
Reputation: 7306
You should assign text group(s) with ()
like below to capture specific part of it.
new_df['just_movie_titles'] = df['movie_title'].str.extract('(.+?) \(')
new_df['just_movie_titles']
pandas.core.strings.StringMethods.extract
StringMethods.extract(pat, flags=0, **kwargs)
Find groups in each string using passed regular expression
Upvotes: 10
Reputation: 862406
You can try str.extract
and strip
, but better is use str.split
, because in names of movies can be numbers too. Next solution is replace
content of parentheses by regex
and strip
leading and trailing whitespaces:
#convert column to string
df['movie_title'] = df['movie_title'].astype(str)
#but it remove numbers in names of movies too
df['titles'] = df['movie_title'].str.extract('([a-zA-Z ]+)', expand=False).str.strip()
df['titles1'] = df['movie_title'].str.split('(', 1).str[0].str.strip()
df['titles2'] = df['movie_title'].str.replace(r'\([^)]*\)', '').str.strip()
print df
movie_title titles titles1 titles2
0 Toy Story 2 (1995) Toy Story Toy Story 2 Toy Story 2
1 GoldenEye (1995) GoldenEye GoldenEye GoldenEye
2 Four Rooms (1995) Four Rooms Four Rooms Four Rooms
3 Get Shorty (1995) Get Shorty Get Shorty Get Shorty
4 Copycat (1995) Copycat Copycat Copycat
Upvotes: 60