soumeng78
soumeng78

Reputation: 880

Issue with pd.DataFrame.apply with arguments

I want to create augmented data in a new dataframe for every row of an original dataframe.

So, I've defined augment method which I want to use in apply as following:

def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int):
    # print(type(row))
    target_df_start_index = target_df.shape[0]
    raw_img = row[column_name].astype('uint8')
    bin_image = convert_image_to_binary_image(raw_img)
    bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
    bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")

    for i in range(num_samples + 1):
        new_row = row.copy(deep=True)

        if i == 0:
            new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
        else:
            aug_image = data_augmentation0(bin_img_reshaped)
            new_row[column_name] = np.squeeze(aug_image, axis=2)

        # display.display(new_row)
        target_df.loc[target_df_start_index + i] = new_row

    # print(target_df.shape)
    # display.display(target_df)

When I call this as following, everything works:

tmp_df = pd.DataFrame(None, columns=testDF.columns)
augment(testDF.iloc[0], column_name='binMap', target_df=tmp_df, num_samples=4)
augment(testDF.iloc[1], column_name='binMap', target_df=tmp_df, num_samples=4)

However, when I try it using 'apply' method, I get the prints or the display working fine but the resultant dataframe shows error

tmp_df = pd.DataFrame(None, columns=testDF.columns)
testDF.apply(augment, args=('binMap', tmp_df, 4, ), axis=1)

This is how the o/p data looks like after the apply call -

,data
<Error>, <Error>
<Error>, <Error>

What am I doing wrong?

Upvotes: 1

Views: 38

Answers (2)

J_H
J_H

Reputation: 20425

Your test is very nice, thank you for the clear exposition. I am happy to be your rubber duck.

In test A, you (successfully) mess with testDF.iloc[0] and [1], using kind of a Fortran-style API for augment(), leaving a side effect result in tmp_df.

Test B is carefully constructed to be "the same" except for the .apply() call. So let's see, what's different? Hard to say. Let's go examine the docs.

Oh, right! We're using the .apply() API, so we'd better follow it. Down at the end it explains:

Returns: Series or DataFrame

Result of applying func along the given axis of the DataFrame.

But you're offering return None instead.

Now, I'm not here to pass judgement on whether it's best to have side effects on a target df -- that's up to you. But .apply() will be bent out of shape until you give it something nice to store as its own result. Happy hunting!


Tiny little style nit.

You wrote

args=('binMap', tmp_df, 4, )

to offer a 3-tuple. Better to write

args=('binMap', tmp_df, 4)

As written it tends to suggest 1-tuple notation.

When is trailing comma helpful?

  1. in a 1-tuple it is essential: x = (7,)
  2. in multiline dict / list expressions it minimizes git diffs, when inevitably another entry ('cherry'?) will later be added
fruits = [
    'apple',
    'banana',
]

Upvotes: 1

soumeng78
soumeng78

Reputation: 880

This change worked for me -

def augment(row: pd.Series, column_name: str, target_df: pd.DataFrame, num_samples: int) -> pd.Series:
    # print(type(row))
    target_df_start_index = target_df.shape[0]
    raw_img = row[column_name].astype('uint8')
    bin_image = convert_image_to_binary_image(raw_img)
    bin_3dimg = tf.expand_dims(input=bin_image, axis=2)
    bin_img_reshaped = tf.image.resize_with_pad(image=bin_3dimg, target_width=128, target_height=128, method="bilinear")

    for i in range(num_samples + 1):
        new_row = row.copy(deep=True)

        if i == 0:
            new_row[column_name] = np.squeeze(bin_img_reshaped, axis=2)
        else:
            aug_image = data_augmentation0(bin_img_reshaped)
            new_row[column_name] = np.squeeze(aug_image, axis=2)

        # display.display(new_row)
        target_df.loc[target_df_start_index + i] = new_row

    # print(target_df.shape)
    # display.display(target_df)
    return row

And updated call to apply as following:

testDF = testDF.apply(augment, args=('binMap', tmp_df, 4, ), result_type='broadcast', axis=1)

Thank you @J_H. If there are better to way to achieve what I'm doing, please feel free to suggest the improvements.

Upvotes: 0

Related Questions