Reputation: 23
I have a DataFrame
with over 111K rows. I'm trying to extract year information(19**, 20**) from uncleaned column Date
and fill year info into a new Result
column, some rows in Date
column contains Chinese/English words.
df.Date.str.extract('20\d{2}') | df.Date.str.extract('19\d{2}')
I used str.extract()
to match and extract the year but I got the ValueError: pattern contains no capture groups
message. How can I get the year information and fill into a new Result
column?
Rating Date
7.8 (June 22, 2000)
8.0 01 April, 1997
8.3 01 December, 1988
7.7 01 November, 2005
7.9 UMl Reprint University Illinois 1966 Ed
7.7 出版日期:2008-06
7.3 出版时间:2009.04
7.7 台北 : 橡樹林文化, 2006.
7.0 机械工业出版社; 第1版 (2014年11月13日)
8.1 民国57年(1968)
7.8 民国79 [1990]
8.9 2010-09-13
9.3 01 (2008)
8.8 1998年4月第11次印刷
7.9 2000
7.3 2004
Upvotes: 2
Views: 484
Reputation: 8816
Below Should the Job For you in the given case.
>>> df
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
>>> df.Date.str.extract(r'(\d{4})', expand=False)
0 2000
1 1998
2 2005
3 2008
4 2000
Or
>>> df['Year'] = df.Date.str.extract(r'(\d{4})', expand=False)
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000
Another trick using assign
, assigning values back to the new column Year
.
>>> df = df.assign(Year = df.Date.str.extract(r'(\d{4})', expand=False))
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000
Upvotes: 0
Reputation: 2005
The error says the regex must have at least one capturing group, that is a sequence between a pair of parethesis.
In the solution I propose, I added a capturing group and two non-capturing ones. As you said the extracted data is then inserted into the Result
column.
>>> df['Result'] = df.Date.str.extract(r'((?:19\d{2})|(?:20\d{2}))')
Rating Date Result
0 7.8 (June 22, 2000) 2000
1 8.0 01 April, 1997 1997
2 8.3 01 December, 1988 1988
3 7.7 01 November, 2005 2005
4 7.9 UMl Reprint University Illinois 1966 Ed 1966
5 7.7 出版日期:2008-06 2008
6 7.3 出版时间:2009.04 2009
7 7.7 �北 : 橡樹林文化, 2006. 2006
8 7.0 机械工业出版社; 第1版 (2014年11月13... 2014
9 8.1 民国57年(1968) 1968
10 7.8 民国79 [1990] 1990
11 8.9 2010-09-13 2010
12 9.3 01 (2008) 2008
13 8.8 1998年4月第11次�刷 1998
14 7.9 2000 2000
15 7.3 None NaN
Upvotes: 0
Reputation: 6639
Sample dataframe:
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
You can also do it as a one liner:
df['Year'] = df.Date.str.extract(r'(19\d{2}|20\d{2})')
Output:
Date Year
2000 2000
1998年4月第11次印刷 1998
01 November, 2005 2005
出版日期:2008-06 2008
(June 22, 2000) 2000
Upvotes: 1