Reputation: 3
When applying a string manipulation function on Pandas data frame column whose length is north of a million rows. Due to some bad data in between it fails with:
AttributeError: 'float' object has no attribute 'lower'
Is there a way I can save the progress made so far on the column?
Let's say the manipulation function is:
def clean_strings(strg):
strg = strg.lower() #lower
return strg
And is applied to the data frame as
df_sample['clean_content'] = df_sample['content'].apply(clean_strings)
Where 'content'
is the column with strings and 'clean_content'
is the new column added.
Please suggest other approaches. TIA
Upvotes: 0
Views: 105
Reputation: 402813
Is there a way I can save the progress made so far on the column?
Unfortunately not, these function calls are meant to act atomically on the dataframe, meaning either the entire operation succeeds, or fails. I'm assuming the str.lower
is just a representative example, you're actually doing much more in your function. That means that this is a job for exception handling.
def clean_string(row):
try:
return row.lower()
except AttributeError:
return row
If a particular record fails, you can handle the raised exception inside the function itself, controlling what is returned in that case.
You'd call the function appropriately -
df_sample['clean_content'] = df_sample['content'].apply(clean_string)
Note that content
is a column of object
s, and objects generally offer very poor performance in terms of vectorised operations. I'd recommend performing a cast to string -
df_sample['content'] = df_sample['content'].astype(str)
After this, consider using pandas' vectorised .str
accessor functions in place of clean_string
.
For reference, if all you want to do is lowercase your string column, use str.lower
-
df_sample['content'] = df_sample['content'].astype(str).str.lower()
Note that, for an object column, you can still use the .str
accessor. However, non-string elements will be coerced to NaN
-
df_sample['content'] = df_sample['content'].str.lower() # assuming `content` is of `object` type
Upvotes: 0
Reputation: 2039
First use map as your input is only 1 column and map is faster than apply
df_sample['clean_content']= df_sample['content'].map(clean_strings)
Secondly just type cast your column to string type to run your function
df['content'] = df['content'].astype(str)
def clean_strings(strg):
strg= strg.lower() #lower
return strg
Upvotes: 1