iamniki
iamniki

Reputation: 561

Merging two dataframes with conditions

So, basically I have two dataframes one is df1 and df2 so each size is [9000000 rows X 11 columns] and [2700000 rows X 18 columns].

Following df1 looks like this,

| Index | Frame | Chunk_Ids |...|P_filter|
|:------|:-----:|:---------:|:-:|:------:|
| 1     | 1     | 20001     |...|   0.142|
| 2     | 1     | 20001     |...|   0.146|
| 3     | 1     | 20001     |...|   0.147|
| .     | .     | .         |...|       .|
| 900000| 260000   | 200020 |0.245| 0.142|

Following `df2` looks like this, 
| Index | Frame  | Chunk_Ids |...| P_filter|
|:------|:------:|:---------:|:-:|:-------:|
| 3     | 1      |    19999  |...|    0.142|
| 6     | 1      |     20001 |...|    0.146|
| 7     | 1      |     20001 |...|    0.147|
| .     | .      | .         |...|        .|
| 270000| 260000 | 200019    |...|    0.142|

I have two questions, how to merge two dataframes with below condition,

  1. Merge if both df contain common columns chunk IDs without missing any data
  2. Compare df1 index with df2 index, and if the particular index is missing then replace whole respective df2 row with '-1'.

I tried and burnt my head out from last two days with many methods but I can't satisify both conditions. Any leads will be appreciated.

Upvotes: 0

Views: 849

Answers (1)

Patryk Kowalski
Patryk Kowalski

Reputation: 571

I didn't test because you did not provide usable examples - but something like this should work

df_merged = df1.merge(df2, on='Chunk_Ids', suffixes=['_df1', '_df2']
df_merged['index_missing'] = df_merged[['Index_df1', 'Index_df2']].apply(lambda x: sum(pd.isnull(x))) > 0
df2_columns = [str(x) + '_df2' for x in df2.columns.tolist()]
df_merged.loc[df_merged.index_missing, df2_columns] = -1

I'm not entirely sure what you meant by not missing any data but you might prefer to use outer join .merge(how='outer'). By default inner join is performed, meaning rows without pair will be lost.

Upvotes: 2

Related Questions