py me
py me

Reputation: 45

Add new column to Panda dataframe based on other column

I'm trying to add a new column to a Panda dataset. This new column df['Year_Prod'] is derived from another one df['title'] from which I am extracting the year.

Data example:

country    designation     title
Italy      Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)         
Portugal   Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)      

Code:

import re

import pandas as pd

df=pd.read_csv(r'test.csv', index_col=0)

df['Year_Prod']=re.findall('\\d+', df['title'])

print(df.head(10))

I am getting the following error:

 File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3119, in __setitem__self._set_item(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3194, in _set_item value = self._sanitize_column(key, value)

  File "C:\Python37\lib\site-packages\pandas\core\frame.py", line 3391, in _sanitize_column value = _sanitize_index(value, self.index, copy=False)

  File "C:\Python37\lib\site-packages\pandas\core\series.py", line 4001, in _sanitize_index raise ValueError('Length of values does not match length of ' 'index')

**ValueError: Length of values does not match length of index**

Please let me know your thoughts on this, thanks.

Upvotes: 4

Views: 1223

Answers (5)

piRSquared
piRSquared

Reputation: 294258

str.translate instead of regex

def f(x):
  x = ''.join([c if c.isdigit() else ' ' for c in x])
  return x.strip().split(None, 1)[0]

df.assign(Year_Prod=df.title.map(f))

    country   designation                                          title Year_Prod
0     Italy  Vulkà Bianco               Nicosia 2013 Vulkà Bianco (Etna)      2013
1  Portugal      Avidagos  Quinta dos Avidagos 2011 Avidagos Red (Douro)      2011

Upvotes: 1

Karn Kumar
Karn Kumar

Reputation: 8816

Just another way around based on iloc method.

>>> df['Year_Prod'] = df.iloc[:,2].str.extract('(\d{4})', expand=False)
>>> df
    country   designation                                          title Year_Prod
0     Italy  Vulkà Bianco               Nicosia 2013 Vulkà Bianco (Etna)      2013
1  Portugal      Avidagos  Quinta dos Avidagos 2011 Avidagos Red (Douro)      2011

Upvotes: 1

BENY
BENY

Reputation: 323226

pandas have findall as well

df.title.str.findall('\d+').str[0]
Out[239]: 
0    2013
1    2011
Name: title, dtype: object

#df['Year_Prod']= df.title.str.findall('\d+').str[0] from pygo

Upvotes: 5

Vaishali
Vaishali

Reputation: 38415

You can use pandas str.extract

df['Year_Prod']= df.title.str.extract('(\d{4})')

    country     designation     title                                          Year_Prod
0   Italy       Vulkà Bianco    Nicosia 2013 Vulkà Bianco (Etna)                2013
1   Portugal    Avidagos        Quinta dos Avidagos 2011 Avidagos Red (Douro)   2011

Edit: As @Paul H. suggested in comments, the reason your code doesn't work is that re.findall expects a string but you are passing a series. It can be done using apply where at every row, the value passed is a string but doesn't make much sense as str.extract is more efficient.

df.title.apply(lambda x: re.findall('\d{4}', x)[0])

Upvotes: 5

Charles Landau
Charles Landau

Reputation: 4265

You are not specifying a delimiter - the default is , for .read_csv

You can use pd.Series.apply:

import re    
import pandas as pd

def year_finder(x):
    return re.findall('\\d+', x)[0] # First match I find

df=pd.read_csv(r'test.csv', delimiter='||', index_col=0)
df['Year_Prod']= df["title"].apply(year_finder)

print(df.head(10))

edit: for str.extract method see @Vaishali's answer

Upvotes: 2

Related Questions