Reputation: 76
I am trying to add multiple columns to a dataframe with numpy.where() in an ETL logic.
This is my df:
I am trying to get my df as:
And the code is:
current_time = pd.Timestamp.utcnow().strftime('%Y-%m-%d %H:%M:%S')
df = pd.concat(
[
df,
pd.DataFrame(
[
np.where(
# When old hash code is available and new hash code is not available. 0 -- N
(
df['new_hash'].isna()
&
~df['old_hash'].isna()
) |
# When hash codes are available and matched. 3.1 -- 'N'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
~(df['new_hash'].ne(df['old_hash']))
),
['N', df['cr_date'], df['up_date']],
np.where(
# When new hash code is available and old hash code is not available. 1 -- Y
(
~df['new_hash'].isna()
&
df['old_hash'].isna()
),
['Y', current_time, current_time],
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
['Y', df['cr_date'], current_time],
['N', df['cr_date'], df['up_date']]
)
)
)
],
index=df.index,
columns=['is_changed', 'cr_date_new', 'up_date_new']
)
],
axis=1
)
Tried above code with df.join()
instead of pd.concat()
. Still giving me below specified ValueError
I am able add one column at a time. and the example is:
df['is_changed'] = (
np.where(
# When old hash code is available and new hash code is not available. 0 -- N
(
df['new_hash'].isna()
&
~df['old_hash'].isna()
) |
# When hash codes are available and matched. 3.1 -- 'N'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
~(df['new_hash'].ne(df['old_hash']))
),
'N',
np.where(
# When new hash code is available and old hash code is not available. 1 -- Y
(
~df['new_hash'].isna()
&
df['old_hash'].isna()
),
'Y',
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
'Y',
'N'
)
)
)
)
But getting error (ValueError: operands could not be broadcast together with shapes (66,) (3,) (3,)
) with multiple columns.
what is the wrong with adding multiple columns? Can someone help me in this?
Upvotes: 0
Views: 1214
Reputation: 231530
In np.where(cond,A,B)
Python evaluates each of cond
, A
and B
, and then passes them to the where
function. where
then broadcasts
the inputs against each other, and performs the element-wise selection. You appear to have 3 nested where
. I'm guessing the error occurs in the inner most one, since it will be evaluated first (I wouldn't have to guess if you provided the error traceback.)
np.where(
# When hash codes are available and matched. 3.2 -- 'Y'
(
~df['new_hash'].isna()
&
~df['old_hash'].isna()
&
df['new_hash'].ne(df['old_hash'])
),
['Y', df['cr_date'], current_time],
['N', df['cr_date'], df['up_date']]
)
The cond
part the first ()
logical_and expression.
The A
is the 3 element list, and B
the next list.
Assuming there are 66 rows, the cond
will have (66,) shape.
np.array(['Y', df['cr_date'], current_time])
is probably a (3,) shape object dtype array, since the inputs consist on a string, a Series, and a time object.
That accounts for the 3 shapes in the error message: shapes (66,) (3,) (3,))
If you try to set only one column at a time, the expression would be np.where(cond, 'Y', 'N')
, or np.where(cond, Series1, Series2)
.
If you don't understand what I (or the error) mean by broadcasting
, you may need to learn more about numpy
(which underlies pandas
).
Upvotes: 1