Reputation: 281
So I have a data frame each row in cola contains movie info in a string like this:
"The Shellshock (2014) Budget: 35,000,000 Release Date: 10/11/2014 Screen Size: 2515 Enhaced 1.1 "
Im trying to extract the budget and the date in their own columns. The budget can range from 1,000,000 to 150,000,000 and the date is mm-dd-yyyy
The first regex is one I made but it's returning NaN values :'(
the second is one of a few I've tried from StackOverflow. It returns "Wrong number of items passed 3, placement implies 1". So it's matching to the other digits?
df['colb'] = df['cola'].str.extract(r'^\d{1,3}(,\d{3})(,\d{3})', expand=True)
df['colc'] = df['cola'].str.extract(r'^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$', expand=True)
any help on these patterns is greatly appreciated!
Upvotes: 0
Views: 186
Reputation: 150785
You have several capture groups in your pattern, each of which returns a columns. So the first command gives you two columns, the second give you three. You cannot assign two/three-column data as a new column. Also, the ^
indicates the start of the string and $
the end. You don't want them since your patterns are in the middle.
You then can do something like this:
df['colb'] = df['cola'].str.extract(r'(\d{1,3},\d{3},\d{3})', expand=True)
Upvotes: 1