Reputation: 45
I'm trying to add a new column to a Panda dataset. This new column df['Year_Prod'] is derived from another one df['title'] from which I am extracting the year.
Data example:
country designation title
Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna)
Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro)
Code:
import re
import pandas as pd
df=pd.read_csv(r'test.csv', index_col=0)
df['Year_Prod']=re.findall('\\d+', df['title'])
print(df.head(10))
I am getting the following error:
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)
File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)
File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')
**ValueError: Length of values does not match length of index**
Please let me know your thoughts on this, thanks.
Upvotes: 4
Views: 1223
Reputation: 294258
str.translate
instead of regex
def f(x):
x = ''.join([c if c.isdigit() else ' ' for c in x])
return x.strip().split(None, 1)[0]
df.assign(Year_Prod=df.title.map(f))
country designation title Year_Prod
0 Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) 2013
1 Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro) 2011
Upvotes: 1
Reputation: 8816
Just another way around based on iloc
method.
>>> df['Year_Prod'] = df.iloc[:,2].str.extract('(\d{4})', expand=False)
>>> df
country designation title Year_Prod
0 Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) 2013
1 Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro) 2011
Upvotes: 1
Reputation: 323226
pandas
have findall
as well
df.title.str.findall('\d+').str[0]
Out[239]:
0 2013
1 2011
Name: title, dtype: object
#df['Year_Prod']= df.title.str.findall('\d+').str[0] from pygo
Upvotes: 5
Reputation: 38415
You can use pandas str.extract
df['Year_Prod']= df.title.str.extract('(\d{4})')
country designation title Year_Prod
0 Italy Vulkà Bianco Nicosia 2013 Vulkà Bianco (Etna) 2013
1 Portugal Avidagos Quinta dos Avidagos 2011 Avidagos Red (Douro) 2011
Edit: As @Paul H. suggested in comments, the reason your code doesn't work is that re.findall expects a string but you are passing a series. It can be done using apply where at every row, the value passed is a string but doesn't make much sense as str.extract is more efficient.
df.title.apply(lambda x: re.findall('\d{4}', x)[0])
Upvotes: 5
Reputation: 4265
You are not specifying a delimiter - the default is ,
for .read_csv
You can use pd.Series.apply
:
import re
import pandas as pd
def year_finder(x):
return re.findall('\\d+', x)[0] # First match I find
df=pd.read_csv(r'test.csv', delimiter='||', index_col=0)
df['Year_Prod']= df["title"].apply(year_finder)
print(df.head(10))
edit: for str.extract
method see @Vaishali's answer
Upvotes: 2