user9510048
user9510048

Reputation:

Comparing Dataframe with different amount of rows and columns

I have a data frame A like this:

CARD_NUMBER TRANSACTION_AMOUNT TERMINAL_ID EXTERNAL_STAN TRANSACTION_DATE
604861*********3947 40000 Y6152114 686817 21/01/2021 09:53:41
440833******4229 200000 Y6152114 686821 21/01/2021 10:04:05
425322******8727 200000 Y6152114 686847 21/01/2021 12:06:25
406606******7268 300000 Y6152114 207606 21/01/2021 15:16:33
425322******5923 300000 Y6152114 683213 21/01/2021 16:25:04

And another data frame B which looks like this:

MICROFILM_REF_NUMBER LINKUP_CODE REVERSAL_FLAG MCC TRANSACTION_AMOUNT TRANSACTION_CURRENCY TRANSACTION_CURRENCY_EXP TRANSACTION_DATE ISSUER_ID ISSUER_ICA ACQUIRER_ICA CLEARING_BIN ACQUIRER_BANK_CODE MERCHANT_NUMBER MERCHANT_COUNTRY_CODE MERCHANT_CITY_NAME MERCHANT_ACRONYM OUTLET_NUMBER TERMINAL_ID MERCHANT_TRANSACTION_SIGN ISSUER_BANK_CODE CLIENT_CODE CARD_NUMBER EXPIRY_DATE TRANSACTION_SIGN BILLING_AMOUNT BILLING_CURRENCY_EXP NETWORK_CODE SETTLEMENT_AMOUNT SETTLEMENT_CURRENCY SETTLEMENT_CURRENCY_EXP FUNCTION_CODE AUTHORIZATION_CODE AUTHORIZATION_DATE ACTION_CODE REFERENCE_NUMBER INTERNAL_STAN EXTERNAL_STAN
70000021024100188324015 AUTH10401003 N 6011 509,47 978 2 26/01/2021 18:39 515880 12658 12658 000003 710 CALENDAR GIRLS BOYS JOHANNESBURG ZA 000045 000045 C ISS003 0360835902 5158800746440660 01/01/2021 D 509,47 2 02 509,47 978 2 205 407309 26/01/2021 17:55 000 407309407309 670676 407309
70000021024100188324320 AUTH10401003 N 6011 3038,85 978 2 26/01/2021 17:59 515880 12658 12658 000003 710 CALENDAR GIRLS BOYS JOHANNESBURG ZA 000045 000045 C ISS003 0348497473 5158800034404519 01/01/2021 D 3038,85 2 02 3038,85 978 2 205 763004 26/01/2021 17:16 000 763004763004 456406 763004
70000021024100188324064 AUTH10401003 N 6011 6762,29 978 2 26/01/2021 18:12 515880 12658 12658 000003 710 CALENDAR GIRLS BOYS JOHANNESBURG ZA 000045 000045 C ISS003 0390027716 5158800663287854 01/01/2021 D 6762,29 2 02 6762,29 978 2 205 001532 26/01/2021 17:29 000 001532001532 536033 001532
70000021024100188324130 AUTH10401003 N 6011 431,87 978 2 26/01/2021 17:35 515880 12658 12658 000003 710 CALENDAR GIRLS BOYS JOHANNESBURG ZA 000045 000045 C ISS003 0360236142 5158800688202193 01/01/2021 D 431,87 2 02 431,87 978 2 205 258257 26/01/2021 16:52 000 258257258257 286480 258257

I want to add a column 'Match' to data frame A so that one of the rows that exist from data frame A exits in data frame B then 'Match' is True, otherwise, it is False.

I also want to generate a file with the number of rows that matched and the number of rows that didn't.

Upvotes: 1

Views: 1961

Answers (1)

Azzedine
Azzedine

Reputation: 505

One possible solution to your problem is to use the merge function offer by Pandas. If you're familiar with SQL, this is just like a table join. The pd.merge adds the columns from df1 to df2 by checking where the two data frames match on the names of the columns. This can be accomplished using the following function :

df3=df1.merge(df2, on=['col1','col2','col3'], how='left', indicator='Match')

        

For example, if df1 was :

FIRST_NAME LAST_NAME SALARY
ahmed mobarki 20000.0
mourad melodi 96000.0
adam karimi 25000.0
fatima barkaoui 30000.0
david mahri 40500.0
said melloki 96400.0
ayoube sharfi 36420.0
john salali 97500.0
hajar mously 52400.0

and df2 was defined as:

enter image description here

Perform Left JOIN with an indicator.

df3 = df1.merge(df2, on=['FIRST_NAME','LAST_NAME','salary'], how='left', indicator='Match')

setting indicator=Match adds a column to the merged DataFame named as Match where the value of each row can be one of three possible values: left_only, right_only, or both.

we are interested when Match is equal both , because it means that the observations merge key is found in both DataFrames (we will then convert the both to True, or else it will be False).

df3['Match'] = df3['Match'] == 'both'

The df3 will look like this :

enter image description here

Now let's back to your dataframes A and B:

C=A.merge(B, on=['CARD_NUMBER','TRANSACTION_AMOUNT','TERMINAL_ID', 'EXTERNAL_STAN', 'TRANSACTION_DATE'], how='left', indicator='Match')

C['Match'] = C['Match'] == 'both'
# create a new column Match in dataframe A  and fill it from C
A["Match"] = C['Match']

Upvotes: 1

Related Questions