Reputation:
I have a data frame A like this:
CARD_NUMBER | TRANSACTION_AMOUNT | TERMINAL_ID | EXTERNAL_STAN | TRANSACTION_DATE |
---|---|---|---|---|
604861*********3947 | 40000 | Y6152114 | 686817 | 21/01/2021 09:53:41 |
440833******4229 | 200000 | Y6152114 | 686821 | 21/01/2021 10:04:05 |
425322******8727 | 200000 | Y6152114 | 686847 | 21/01/2021 12:06:25 |
406606******7268 | 300000 | Y6152114 | 207606 | 21/01/2021 15:16:33 |
425322******5923 | 300000 | Y6152114 | 683213 | 21/01/2021 16:25:04 |
And another data frame B which looks like this:
MICROFILM_REF_NUMBER | LINKUP_CODE | REVERSAL_FLAG | MCC | TRANSACTION_AMOUNT | TRANSACTION_CURRENCY | TRANSACTION_CURRENCY_EXP | TRANSACTION_DATE | ISSUER_ID | ISSUER_ICA | ACQUIRER_ICA | CLEARING_BIN | ACQUIRER_BANK_CODE | MERCHANT_NUMBER | MERCHANT_COUNTRY_CODE | MERCHANT_CITY_NAME | MERCHANT_ACRONYM | OUTLET_NUMBER | TERMINAL_ID | MERCHANT_TRANSACTION_SIGN | ISSUER_BANK_CODE | CLIENT_CODE | CARD_NUMBER | EXPIRY_DATE | TRANSACTION_SIGN | BILLING_AMOUNT | BILLING_CURRENCY_EXP | NETWORK_CODE | SETTLEMENT_AMOUNT | SETTLEMENT_CURRENCY | SETTLEMENT_CURRENCY_EXP | FUNCTION_CODE | AUTHORIZATION_CODE | AUTHORIZATION_DATE | ACTION_CODE | REFERENCE_NUMBER | INTERNAL_STAN | EXTERNAL_STAN |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
70000021024100188324015 | AUTH10401003 | N | 6011 | 509,47 | 978 | 2 | 26/01/2021 18:39 | 515880 | 12658 | 12658 | 000003 | 710 | CALENDAR GIRLS BOYS JOHANNESBURG ZA | 000045 | 000045 | C | ISS003 | 0360835902 | 5158800746440660 | 01/01/2021 | D | 509,47 | 2 | 02 | 509,47 | 978 | 2 | 205 | 407309 | 26/01/2021 17:55 | 000 | 407309407309 | 670676 | 407309 | |||
70000021024100188324320 | AUTH10401003 | N | 6011 | 3038,85 | 978 | 2 | 26/01/2021 17:59 | 515880 | 12658 | 12658 | 000003 | 710 | CALENDAR GIRLS BOYS JOHANNESBURG ZA | 000045 | 000045 | C | ISS003 | 0348497473 | 5158800034404519 | 01/01/2021 | D | 3038,85 | 2 | 02 | 3038,85 | 978 | 2 | 205 | 763004 | 26/01/2021 17:16 | 000 | 763004763004 | 456406 | 763004 | |||
70000021024100188324064 | AUTH10401003 | N | 6011 | 6762,29 | 978 | 2 | 26/01/2021 18:12 | 515880 | 12658 | 12658 | 000003 | 710 | CALENDAR GIRLS BOYS JOHANNESBURG ZA | 000045 | 000045 | C | ISS003 | 0390027716 | 5158800663287854 | 01/01/2021 | D | 6762,29 | 2 | 02 | 6762,29 | 978 | 2 | 205 | 001532 | 26/01/2021 17:29 | 000 | 001532001532 | 536033 | 001532 | |||
70000021024100188324130 | AUTH10401003 | N | 6011 | 431,87 | 978 | 2 | 26/01/2021 17:35 | 515880 | 12658 | 12658 | 000003 | 710 | CALENDAR GIRLS BOYS JOHANNESBURG ZA | 000045 | 000045 | C | ISS003 | 0360236142 | 5158800688202193 | 01/01/2021 | D | 431,87 | 2 | 02 | 431,87 | 978 | 2 | 205 | 258257 | 26/01/2021 16:52 | 000 | 258257258257 | 286480 | 258257 |
I want to add a column 'Match' to data frame A so that one of the rows that exist from data frame A exits in data frame B then 'Match' is True, otherwise, it is False.
I also want to generate a file with the number of rows that matched and the number of rows that didn't.
Upvotes: 1
Views: 1961
Reputation: 505
One possible solution to your problem is to use the merge function offer by Pandas. If you're familiar with SQL, this is just like a table join. The pd.merge
adds the columns from df1 to df2 by checking where the two data frames match on the names of the columns. This can be accomplished using the following function :
df3=df1.merge(df2, on=['col1','col2','col3'], how='left', indicator='Match')
For example, if df1 was :
FIRST_NAME | LAST_NAME | SALARY |
---|---|---|
ahmed | mobarki | 20000.0 |
mourad | melodi | 96000.0 |
adam | karimi | 25000.0 |
fatima | barkaoui | 30000.0 |
david | mahri | 40500.0 |
said | melloki | 96400.0 |
ayoube | sharfi | 36420.0 |
john | salali | 97500.0 |
hajar | mously | 52400.0 |
and df2 was defined as:
Perform Left JOIN with an indicator.
df3 = df1.merge(df2, on=['FIRST_NAME','LAST_NAME','salary'], how='left', indicator='Match')
setting indicator=Match
adds a column to the merged DataFame named as Match
where the value of each row can be one of three possible values: left_only, right_only, or both
.
we are interested when Match is equal both
, because it means that the observations merge key is found in both DataFrames (we will then convert the both
to True
, or else it will be False
).
df3['Match'] = df3['Match'] == 'both'
The df3
will look like this :
Now let's back to your dataframes A and B:
C=A.merge(B, on=['CARD_NUMBER','TRANSACTION_AMOUNT','TERMINAL_ID', 'EXTERNAL_STAN', 'TRANSACTION_DATE'], how='left', indicator='Match')
C['Match'] = C['Match'] == 'both'
# create a new column Match in dataframe A and fill it from C
A["Match"] = C['Match']
Upvotes: 1