Praveen Kumar B
Praveen Kumar B

Reputation: 76

Adding multiple columns to pandas dataframe with np.where clause

I am trying to add multiple columns to a dataframe with numpy.where() in an ETL logic.

This is my df:

enter image description here

I am trying to get my df as:

enter image description here

And the code is:

current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')

df = pd.concat(
    [
        df,
        pd.DataFrame(
            [
                np.where(
                    # When old hash code is available and new hash code is not available. 0 -- N
                    (
                            df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                    ) |
                    # When hash codes are available and matched. 3.1 -- 'N'
                    (
                            ~df['new_hash'].isna()
                            &
                            ~df['old_hash'].isna()
                            &
                            ~(df['new_hash'].ne(df['old_hash']))
                    ),
                    ['N', df['cr_date'], df['up_date']],
                    np.where(
                        # When new hash code is available and old hash code is not available. 1 -- Y
                        (
                                ~df['new_hash'].isna()
                                &
                                df['old_hash'].isna()
                        ),
                        ['Y', current_time, current_time],
                        np.where(
                            # When hash codes are available and matched. 3.2 -- 'Y'
                            (
                                    ~df['new_hash'].isna()
                                    &
                                    ~df['old_hash'].isna()
                                    &
                                    df['new_hash'].ne(df['old_hash'])
                            ),
                            ['Y', df['cr_date'], current_time],
                            ['N', df['cr_date'], df['up_date']]
                        )
                    )
                )
            ],
            index=df.index,
            columns=['is_changed', 'cr_date_new', 'up_date_new']
        )
    ],
    axis=1
)

Tried above code with df.join() instead of pd.concat(). Still giving me below specified ValueError

I am able add one column at a time. and the example is:

df['is_changed'] = (
    np.where(
        # When old hash code is available and new hash code is not available. 0 -- N
        (
                df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
        ) |
        # When hash codes are available and matched. 3.1 -- 'N'
        (
                ~df['new_hash'].isna()
                &
                ~df['old_hash'].isna()
                &
                ~(df['new_hash'].ne(df['old_hash']))
        ),
        'N',
        np.where(
            # When new hash code is available and old hash code is not available. 1 -- Y
            (
                    ~df['new_hash'].isna()
                    &
                    df['old_hash'].isna()
            ),
            'Y',
            np.where(
                # When hash codes are available and matched. 3.2 -- 'Y'
                (
                        ~df['new_hash'].isna()
                        &
                        ~df['old_hash'].isna()
                        &
                        df['new_hash'].ne(df['old_hash'])
                ),
                'Y',
                'N'
            )
        )
    )
)

But getting error (ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,)) with multiple columns.

what is the wrong with adding multiple columns? Can someone help me in this?

Upvotes: 0

Views: 1214

Answers (1)

hpaulj
hpaulj

Reputation: 231530

In np.where(cond,A,B) Python evaluates each of cond, A and B, and then passes them to the where function. where then broadcasts the inputs against each other, and performs the element-wise selection. You appear to have 3 nested where. I'm guessing the error occurs in the inner most one, since it will be evaluated first (I wouldn't have to guess if you provided the error traceback.)

                    np.where(
                        # When hash codes are available and matched. 3.2 -- 'Y'
                        (
                                ~df['new_hash'].isna()
                                &
                                ~df['old_hash'].isna()
                                &
                                df['new_hash'].ne(df['old_hash'])
                        ),
                        ['Y', df['cr_date'], current_time],
                        ['N', df['cr_date'], df['up_date']]
                    )

The cond part the first () logical_and expression.

The A is the 3 element list, and B the next list.

Assuming there are 66 rows, the cond will have (66,) shape.

np.array(['Y', df['cr_date'], current_time]) is probably a (3,) shape object dtype array, since the inputs consist on a string, a Series, and a time object.

That accounts for the 3 shapes in the error message: shapes (66,) (3,) (3,))

If you try to set only one column at a time, the expression would be np.where(cond, 'Y', 'N'), or np.where(cond, Series1, Series2).

If you don't understand what I (or the error) mean by broadcasting, you may need to learn more about numpy (which underlies pandas).

Upvotes: 1

Related Questions