Jerry Gao
Jerry Gao

Reputation: 23

Python pandas str.extract year information from unclean column

I have a DataFrame with over 111K rows. I'm trying to extract year information(19**, 20**) from uncleaned column Date and fill year info into a new Result column, some rows in Date column contains Chinese/English words.

df.Date.str.extract('20\d{2}') | df.Date.str.extract('19\d{2}')

I used str.extract() to match and extract the year but I got the ValueError: pattern contains no capture groups message. How can I get the year information and fill into a new Result column?

Rating    Date    
7.8    (June 22, 2000)   
8.0    01 April, 1997    
8.3    01 December, 1988    
7.7    01 November, 2005    
7.9    UMl Reprint University Illinois 1966 Ed    
7.7    出版日期:2008-06    
7.3    出版时间:2009.04    
7.7    台北 : 橡樹林文化, 2006.    
7.0    机械工业出版社; 第1版 (2014年11月13日)    
8.1    民国57年(1968)    
7.8    民国79 [1990]    
8.9    2010-09-13    
9.3    01 (2008)    
8.8    1998年4月第11次印刷    
7.9    2000    
7.3    2004    

Upvotes: 2

Views: 484

Answers (3)

Karn Kumar
Karn Kumar

Reputation: 8816

Below Should the Job For you in the given case.

Just an example dataset:

>>> df
                Date
0               2000
1      1998年4月第11次印刷
2  01 November, 2005
3       出版日期:2008-06
4    (June 22, 2000)

Solution:

>>> df.Date.str.extract(r'(\d{4})', expand=False)
0    2000
1    1998
2    2005
3    2008
4    2000

Or

>>> df['Year'] = df.Date.str.extract(r'(\d{4})', expand=False)
>>> df
                Date        Year
0               2000        2000
1      1998年4月第11次印刷   1998
2  01 November, 2005        2005
3       出版日期:2008-06    2008
4    (June 22, 2000)        2000

Another trick using assign , assigning values back to the new column Year.

>>> df  = df.assign(Year = df.Date.str.extract(r'(\d{4})', expand=False))
>>> df
                Date        Year
0               2000        2000
1      1998年4月第11次印刷   1998
2  01 November, 2005        2005
3       出版日期:2008-06    2008
4    (June 22, 2000)        2000

Upvotes: 0

Giova
Giova

Reputation: 2005

The error says the regex must have at least one capturing group, that is a sequence between a pair of parethesis.

In the solution I propose, I added a capturing group and two non-capturing ones. As you said the extracted data is then inserted into the Result column.

>>> df['Result'] = df.Date.str.extract(r'((?:19\d{2})|(?:20\d{2}))')

    Rating                                               Date Result
0      7.8                                    (June 22, 2000)   2000
1      8.0                                     01 April, 1997   1997
2      8.3                                  01 December, 1988   1988
3      7.7                                  01 November, 2005   2005
4      7.9            UMl Reprint University Illinois 1966 Ed   1966
5      7.7                             出版日期:2008-06   2008
6      7.3                             出版时间:2009.04   2009
7      7.7                    �北 : 橡樹林文化, 2006.   2006
8      7.0  机械工业出版社; 第1版 (2014年11月13...   2014
9      8.1                              民国57年(1968)   1968
10     7.8                                    民国79 [1990]   1990
11     8.9                                         2010-09-13   2010
12     9.3                                          01 (2008)   2008
13     8.8                          1998年4月第11次�刷   1998
14     7.9                                               2000   2000
15     7.3                                               None    NaN

Upvotes: 0

Ankur Sinha
Ankur Sinha

Reputation: 6639

Sample dataframe:

                Date
0               2000
1      1998年4月第11次印刷
2  01 November, 2005
3       出版日期:2008-06
4    (June 22, 2000)

You can also do it as a one liner:

df['Year'] = df.Date.str.extract(r'(19\d{2}|20\d{2})')

Output:

Date                Year
2000                2000
1998年4月第11次印刷   1998
01 November, 2005   2005
出版日期:2008-06     2008
(June 22, 2000)     2000

Upvotes: 1

Related Questions