Reputation: 366
Be the following python pandas DataFrame:
ID | country | money | code | money_add | other | time |
---|---|---|---|---|---|---|
832932 | Other | NaN | 00000 | NaN | [N2,N2,N4] | 0 days 01:37:00 |
217#8# | NaN | NaN | NaN | NaN | [N1,N2,N3] | 2 days 01:01:00 |
1329T2 | France | 12131 | 00020 | 3452 | [N1,N1] | 1 days 03:55:00 |
124932 | France | NaN | 00016 | NaN | [N2] | 0 days 01:28:00 |
194022 | France | NaN | 00000 | NaN | [N4,N3] | 3 days 02:35:00 |
If code
column is not NaN
and the money
column is NaN
, we update the values money
and money_add
from the following table. Using the code
and cod_t
columns as a key.
cod_t | money | money_add |
---|---|---|
00000 | 4532 | 72323 |
00016 | 1213 | 23822 |
00030 | 1313 | 8393 |
00020 | 1813 | 27328 |
Example of the resulting table:
ID | country | money | code | money_add | other | time |
---|---|---|---|---|---|---|
832932 | Other | 4532 | 00000 | 72323 | [N2,N2,N4] | 0 days 01:37:00 |
217#8# | NaN | NaN | NaN | NaN | [N1,N2,N3] | 2 days 01:01:00 |
1329T2 | France | 12131 | 00020 | 3452 | [N1,N1] | 1 days 03:55:00 |
124932 | France | 1213 | 00016 | 23822 | [N2] | 0 days 01:28:00 |
194022 | France | 4532 | 00000 | 72323 | [N4,N3] | 3 days 02:35:00 |
User @jezrael, gave me the following solution to the problem:
df1 = df1.drop_duplicates('cod_t').set_index('cod_t')
df = df.set_index(df['code'])
df.update(df1, overwrite=False)
df = df.reset_index(drop=True).reindex(df.columns, axis=1)
But this code gives me an error that I don't know how to solve:
TypeError: The DType <class 'numpy.dtype[timedelta64]'> could not be promoted by <class
'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs.
For example they cannot be stored in a single array unless the dtype is `object`.
The full list of DTypes is: (<class 'numpy.dtype[timedelta64]'>, <class 'numpy.dtype[float64]'>)
// First DataFrame dtypes
ID object
country object
code object
money float64
money_add float64
other object
time timedelta64[ns]
dtype: object
// Second DataFrame dtypes
cod_t object
money int64
money_add int64
dtype: object
I would be grateful if you could help me to solve the error, or suggest an alternative method to using update
.
Upvotes: 1
Views: 254
Reputation:
This is another method different from Jezrael's, but you can try it out.
You can first create a condition variable for your dataframe.
condition = (df.code.isin(df1.cod_t) & ~df.code.isnull() & df.money.isna())
columns = ['money', 'money_add']
Next, use df.loc
to do the update.
df.loc[condition, columns] = df1.loc[condition, columns]
ID country money code money_add other time
0 832932 Other 4532.0 0.0 72323.0 [N2,N2,N4] 0 days 01:37:00
1 217#8# NaN NaN NaN NaN [N1,N2,N3] 2 days 01:01:00
2 1329T2 France 12131.0 20.0 3452.0 [N1,N1] 1 days 03:55:00
3 124932 France 1813.0 16.0 27328.0 [N2] 0 days 01:28:00
4 194022 France 8932.0 0.0 3204.0 [N4,N3] 3 days 02:35:00
If there's an unequal length for both dataframe.
df1_cond = df1.cod_t.isin(df.loc[condition].code)
result = [i[1:] for row in df.loc[condition].code for i in df1.loc[df1_cond].values if row in i]
df.loc[condition, columns] = result
Upvotes: 2
Reputation: 863351
Because DataFrame.update
not working well here is alternative - first use left join for new columns from second DataFrame by DataFrame.merge
:
df2 = df.merge(df1.drop_duplicates('cod_t').rename(columns={'cod_t':'code'}),
on='code',
how='left',
suffixes=('','_'))
print (df2)
ID country money code money_add other time \
0 832932 Other NaN 0.0 NaN [N2, N2, N4] 0 days 01:37:00
1 217#8# NaN NaN NaN NaN [N1, N2, N3] 2 days 01:01:00
2 1329T2 France 12131.0 20.0 3452.0 [N1, N1] 1 days 03:55:00
3 124932 France NaN 16.0 NaN [N2] 0 days 01:28:00
4 194022 France NaN 0.0 NaN [N4, N3] 3 days 02:35:00
money_ money_add_
0 4532.0 72323.0
1 NaN NaN
2 1813.0 27328.0
3 1213.0 23822.0
4 4532.0 72323.0
Then get columns names with/without _
:
cols_with_ = df2.columns[df2.columns.str.endswith('_')]
cols_without_ = cols_with_.str.rstrip('_')
print (cols_with_)
Index(['money_', 'money_add_'], dtype='object')
print (cols_without_)
Index(['money', 'money_add'], dtype='object')
Pass to DataFrame.combine_first
and last remove helper columns:
df2[cols_without_] = (df2[cols_without_].combine_first(df2[cols_with_]
.rename(columns=lambda x: x.rstrip('_'))))
df2 = df2.drop(cols_with_, axis=1)
print (df2)
ID country money code money_add other time
0 832932 Other 4532.0 0.0 72323.0 [N2, N2, N4] 0 days 01:37:00
1 217#8# NaN NaN NaN NaN [N1, N2, N3] 2 days 01:01:00
2 1329T2 France 12131.0 20.0 3452.0 [N1, N1] 1 days 03:55:00
3 124932 France 1213.0 16.0 23822.0 [N2] 0 days 01:28:00
4 194022 France 4532.0 0.0 72323.0 [N4, N3] 3 days 02:35:00
Upvotes: 4