Extracting Dates and Large Numbers Using Regex

Question

So I have a data frame each row in cola contains movie info in a string like this:

"The Shellshock (2014) Budget: 35,000,000 Release Date: 10/11/2014 Screen Size: 2515 Enhaced 1.1 "

Im trying to extract the budget and the date in their own columns. The budget can range from 1,000,000 to 150,000,000 and the date is mm-dd-yyyy

The first regex is one I made but it's returning NaN values :'(

the second is one of a few I've tried from StackOverflow. It returns "Wrong number of items passed 3, placement implies 1". So it's matching to the other digits?

df['colb'] = df['cola'].str.extract(r'^\d{1,3}(,\d{3})(,\d{3})', expand=True)

df['colc'] = df['cola'].str.extract(r'^(0[1-9]|1[012])[- /.](0[1-9]|[12][0-9]|3[01])[- /.](19|20)\d\d$', expand=True)

any help on these patterns is greatly appreciated!

Quang Hoang · Accepted Answer

You have several capture groups in your pattern, each of which returns a columns. So the first command gives you two columns, the second give you three. You cannot assign two/three-column data as a new column. Also, the ^ indicates the start of the string and $ the end. You don't want them since your patterns are in the middle.

You then can do something like this:

df['colb'] = df['cola'].str.extract(r'(\d{1,3},\d{3},\d{3})', expand=True)

Extracting Dates and Large Numbers Using Regex

Answers (1)

Related Questions