Retain successfully transformed rows in the event of runtime error in pandas

Question

When applying a string manipulation function on Pandas data frame column whose length is north of a million rows. Due to some bad data in between it fails with:

AttributeError: 'float' object has no attribute 'lower'

Is there a way I can save the progress made so far on the column?

Let's say the manipulation function is:

def clean_strings(strg):
    strg = strg.lower() #lower
    return strg

And is applied to the data frame as

df_sample['clean_content'] = df_sample['content'].apply(clean_strings)

Where 'content' is the column with strings and 'clean_content' is the new column added.

Please suggest other approaches. TIA

cs95 · Accepted Answer

Is there a way I can save the progress made so far on the column?

Unfortunately not, these function calls are meant to act atomically on the dataframe, meaning either the entire operation succeeds, or fails. I'm assuming the str.lower is just a representative example, you're actually doing much more in your function. That means that this is a job for exception handling.

def clean_string(row):
    try:
        return row.lower()
    except AttributeError:
        return row

If a particular record fails, you can handle the raised exception inside the function itself, controlling what is returned in that case.

You'd call the function appropriately -

df_sample['clean_content'] = df_sample['content'].apply(clean_string)

Note that content is a column of objects, and objects generally offer very poor performance in terms of vectorised operations. I'd recommend performing a cast to string -

df_sample['content'] = df_sample['content'].astype(str)

After this, consider using pandas' vectorised .str accessor functions in place of clean_string.

For reference, if all you want to do is lowercase your string column, use str.lower -

df_sample['content'] = df_sample['content'].astype(str).str.lower()

Note that, for an object column, you can still use the .str accessor. However, non-string elements will be coerced to NaN -

df_sample['content'] = df_sample['content'].str.lower()  # assuming `content` is of `object` type

Retain successfully transformed rows in the event of runtime error in pandas

Answers (2)

Related Questions