Comparing two dataframes to return a new dataframe using pandas - Python

Question

Need your help please.

I have two dataframes created from csvs and I need to return a new dataframe which will be the difference between the two on a specific field/column. For example, if ID from df1 is not in df2, then df3 should give me all columns and rows from df1 that are not in df2.

Note df1 and df2 columns are not identical i.e df1 could have more or less columns than df2 but the columns in df3 should be as of d1. Also, the ID (from df1) and User ID (df2) values are going to be the reconciling factor, the data in the fields will be the common factor but the actual field names are different.

Apologies in advance as the tables below are not clear. So in the below example, 1st row in df1 is not in df2, df3 should have this row. Once done, I need to save df3 as csv.

DF1

Direction ID Quantity Company Status

Sell - 09 - 32000 - T LTD - Rejected

Buy - 12 - 25000 - G Ltd - Done

Sell - 15 - 35000 - H Ltd - Done

DF2

Direction User ID Quantity Company Status Rating

Buy - 12 - 25000 - G Ltd - Done - Good Rating

Sell - 15 - 35000 - H Ltd - Done - Good Rating

Many thanks in advance

code so far:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

I have found the isin method which I think will help but the problem is that the "User ID" column (df2) has a space in the data frame (as is the case in the csv).

createDf1[createDf1.ID.isin(createDf2.columns[2].values)]

and I get the below error when

AttributeError: 'str' object has no attribute 'values'

I passed columns [2] in the isin method as the User ID has a space

Please help address the error and why the data is not being read so that i can get a unique set where the user Id from df2 is not in ID in df1.

See below - the one highligted is the one that is missing in DF2 and I would like this in df3

CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

Gerardo Reichl · Accepted Answer

I would do:

import pandas as pd

fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"

createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')

# df3 will have createDf1 columns with ID's that are not in createDf2
# ~ means 'not' to the filter
# Acces the column via ['COLUMN NAME'] so you can put spaces into it ;)
df3 = createDf1[~createDf1['ID'].isin(createDf2['User ID'])]

I hope this helps!

Comparing two dataframes to return a new dataframe using pandas - Python

Direction ID Quantity Company Status

Direction User ID Quantity Company Status Rating

Answers (1)

Related Questions