Reputation: 969
I am trying to write a function to do some text processing on the specified columns (description, event_name) of a Pandas dataframe. I wrote this code:
#removal of unreadable chars, unwanted spaces, words of at most length two from 'description' column and lowercase the 'description' column
def data_preprocessing(source):
return source.replace('[^A-Za-z]',' ')
#data['description'] = data['description'].str.replace('\W+',' ')
return source.lower()
return source.replace("\s\s+" , " ")
return source.replace('\s+[a-z]{1,2}(?!\S)',' ')
return source.replace("\s\s+" , " ")
data['description'] = data['description'].apply(lambda row: data_preprocessing(row))
data['event_name'] = data['event_name'].apply(lambda row: data_preprocessing(row))
It is giving the following error:
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-94-cb5ec147833f> in <module>()
----> 1 data['description'] = data['description'].apply(lambda row: data_preprocessing(row))
2 data['event_name'] = data['event_name'].apply(lambda row: data_preprocessing(row))
3
4 #df['words']=df['words'].apply(lambda row: eliminate_space(row))
5
~/anaconda3/envs/tensorflow/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2549 else:
2550 values = self.asobject
-> 2551 mapped = lib.map_infer(values, f, convert=convert_dtype)
2552
2553 if len(mapped) and isinstance(mapped[0], Series):
pandas/_libs/src/inference.pyx in pandas._libs.lib.map_infer()
<ipython-input-94-cb5ec147833f> in <lambda>(row)
----> 1 data['description'] = data['description'].apply(lambda row: data_preprocessing(row))
2 data['event_name'] = data['event_name'].apply(lambda row: data_preprocessing(row))
data['description'] = data['description'].str.replace('\W+',' ')
<ipython-input-93-fdfec5f52a06> in data_preprocessing(source)
3 def data_preprocessing(source):
4
----> 5 return source.replace('[^A-Za-z]',' ')
6 #data['description'] = data['description'].str.replace('\W+',' ')
7 source = source.lower()
AttributeError: 'float' object has no attribute 'replace'
If I write the code in following way, without function, it works perfectly:
data['description'] = data['description'].str.replace('[^A-Za-z]',' ')
Upvotes: 0
Views: 11834
Reputation: 11105
Two things to fix:
First, when you apply
a lambda function to a pandas Series, the lambda function is applied to each element of the Series. What I think you need is to apply your function to the entire Series in a vectorized manner.
Second, your function has multiple return statements. As a result, only the first statement, return source.replace('[^A-Za-z]',' ')
, will ever run. What you need to do is make your preprocessing changes on the variable source
inside your function, and finally return the modified source
(or an intermediate variable) at the very end.
To rewrite your function to operate on an entire pandas Series, replace every occurrence of source.
with source.str.
. The new function definition:
def data_preprocessing(source):
source = source.str.replace('[^A-Za-z]',' ')
#data['description'] = data['description'].str.replace('\W+',' ')
source = source.str.lower()
source = source.str.replace("\s\s+" , " ")
source = source.str.replace('\s+[a-z]{1,2}(?!\S)',' ')
source = source.str.replace("\s\s+" , " ")
return source
Then, instead of this:
data['description'] = data['description'].apply(lambda row: data_preprocessing(row))
data['event_name'] = data['event_name'].apply(lambda row: data_preprocessing(row))
Try this:
data['description'] = data_preprocessing(data['description'])
data['event_name'] = data_preprocessing(data['event_name'])
Upvotes: 5