Carola
Carola

Reputation: 366

Error when using update between two tables joined by a code

Be the following python pandas DataFrame:

ID country money code money_add other time
832932 Other NaN 00000 NaN [N2,N2,N4] 0 days 01:37:00
217#8# NaN NaN NaN NaN [N1,N2,N3] 2 days 01:01:00
1329T2 France 12131 00020 3452 [N1,N1] 1 days 03:55:00
124932 France NaN 00016 NaN [N2] 0 days 01:28:00
194022 France NaN 00000 NaN [N4,N3] 3 days 02:35:00

If code column is not NaN and the money column is NaN, we update the values money and money_add from the following table. Using the code and cod_t columns as a key.

cod_t money money_add
00000 4532 72323
00016 1213 23822
00030 1313 8393
00020 1813 27328

Example of the resulting table:

ID country money code money_add other time
832932 Other 4532 00000 72323 [N2,N2,N4] 0 days 01:37:00
217#8# NaN NaN NaN NaN [N1,N2,N3] 2 days 01:01:00
1329T2 France 12131 00020 3452 [N1,N1] 1 days 03:55:00
124932 France 1213 00016 23822 [N2] 0 days 01:28:00
194022 France 4532 00000 72323 [N4,N3] 3 days 02:35:00

User @jezrael, gave me the following solution to the problem:

df1 = df1.drop_duplicates('cod_t').set_index('cod_t')
df = df.set_index(df['code'])
df.update(df1, overwrite=False)
df = df.reset_index(drop=True).reindex(df.columns, axis=1)

But this code gives me an error that I don't know how to solve:

TypeError: The DType <class 'numpy.dtype[timedelta64]'> could not be promoted by <class
'numpy.dtype[float64]'>. This means that no common DType exists for the given inputs. 
For example they cannot be stored in a single array unless the dtype is `object`. 
The full list of DTypes is: (<class 'numpy.dtype[timedelta64]'>, <class 'numpy.dtype[float64]'>)
// First DataFrame dtypes
ID                                 object
country                            object
code                               object
money                             float64
money_add                         float64
other                              object
time                      timedelta64[ns]
dtype: object
// Second DataFrame dtypes
cod_t                     object
money                      int64
money_add                  int64
dtype: object

I would be grateful if you could help me to solve the error, or suggest an alternative method to using update.

Upvotes: 1

Views: 254

Answers (2)

user16836078
user16836078

Reputation:

This is another method different from Jezrael's, but you can try it out.

You can first create a condition variable for your dataframe.

condition = (df.code.isin(df1.cod_t) & ~df.code.isnull() & df.money.isna())
columns = ['money', 'money_add']

Next, use df.loc to do the update.

df.loc[condition, columns] = df1.loc[condition, columns]

       ID country    money  code  money_add       other            time
0  832932   Other   4532.0   0.0    72323.0  [N2,N2,N4] 0 days 01:37:00
1  217#8#     NaN      NaN   NaN        NaN  [N1,N2,N3] 2 days 01:01:00
2  1329T2  France  12131.0  20.0     3452.0     [N1,N1] 1 days 03:55:00
3  124932  France   1813.0  16.0    27328.0        [N2] 0 days 01:28:00
4  194022  France   8932.0   0.0     3204.0     [N4,N3] 3 days 02:35:00

Update

If there's an unequal length for both dataframe.

df1_cond = df1.cod_t.isin(df.loc[condition].code)

result = [i[1:] for row in df.loc[condition].code for i in df1.loc[df1_cond].values if row in i]

df.loc[condition, columns] = result

Upvotes: 2

jezrael
jezrael

Reputation: 863351

Because DataFrame.update not working well here is alternative - first use left join for new columns from second DataFrame by DataFrame.merge:

df2 = df.merge(df1.drop_duplicates('cod_t').rename(columns={'cod_t':'code'}), 
               on='code', 
               how='left',
               suffixes=('','_'))

print (df2)
       ID country    money  code  money_add         other            time  \
0  832932   Other      NaN   0.0        NaN  [N2, N2, N4] 0 days 01:37:00   
1  217#8#     NaN      NaN   NaN        NaN  [N1, N2, N3] 2 days 01:01:00   
2  1329T2  France  12131.0  20.0     3452.0      [N1, N1] 1 days 03:55:00   
3  124932  France      NaN  16.0        NaN          [N2] 0 days 01:28:00   
4  194022  France      NaN   0.0        NaN      [N4, N3] 3 days 02:35:00   

   money_  money_add_  
0  4532.0     72323.0  
1     NaN         NaN  
2  1813.0     27328.0  
3  1213.0     23822.0  
4  4532.0     72323.0 

Then get columns names with/without _:

cols_with_ = df2.columns[df2.columns.str.endswith('_')]
cols_without_ = cols_with_.str.rstrip('_')

print (cols_with_)
Index(['money_', 'money_add_'], dtype='object')

print (cols_without_)
Index(['money', 'money_add'], dtype='object')

Pass to DataFrame.combine_first and last remove helper columns:

df2[cols_without_] = (df2[cols_without_].combine_first(df2[cols_with_]
                                        .rename(columns=lambda x: x.rstrip('_'))))
df2 = df2.drop(cols_with_, axis=1)
print (df2)
       ID country    money  code  money_add         other            time
0  832932   Other   4532.0   0.0    72323.0  [N2, N2, N4] 0 days 01:37:00
1  217#8#     NaN      NaN   NaN        NaN  [N1, N2, N3] 2 days 01:01:00
2  1329T2  France  12131.0  20.0     3452.0      [N1, N1] 1 days 03:55:00
3  124932  France   1213.0  16.0    23822.0          [N2] 0 days 01:28:00
4  194022  France   4532.0   0.0    72323.0      [N4, N3] 3 days 02:35:00

Upvotes: 4

Related Questions