Reputation: 115
Need your help please.
I have two dataframes created from csvs and I need to return a new dataframe which will be the difference between the two on a specific field/column. For example, if ID from df1 is not in df2, then df3 should give me all columns and rows from df1 that are not in df2.
Note df1 and df2 columns are not identical i.e df1 could have more or less columns than df2 but the columns in df3 should be as of d1. Also, the ID (from df1) and User ID (df2) values are going to be the reconciling factor, the data in the fields will be the common factor but the actual field names are different.
Apologies in advance as the tables below are not clear. So in the below example, 1st row in df1 is not in df2, df3 should have this row. Once done, I need to save df3 as csv.
DF1
Sell - 09 - 32000 - T LTD - Rejected
Buy - 12 - 25000 - G Ltd - Done
Sell - 15 - 35000 - H Ltd - Done
DF2
Buy - 12 - 25000 - G Ltd - Done - Good Rating
Sell - 15 - 35000 - H Ltd - Done - Good Rating
Many thanks in advance
code so far:
import pandas as pd
fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"
createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')
I have found the isin method which I think will help but the problem is that the "User ID" column (df2) has a space in the data frame (as is the case in the csv).
createDf1[createDf1.ID.isin(createDf2.columns[2].values)]
and I get the below error when
AttributeError: 'str' object has no attribute 'values'
I passed columns [2] in the isin method as the User ID has a space
Please help address the error and why the data is not being read so that i can get a unique set where the user Id from df2 is not in ID in df1.
See below - the one highligted is the one that is missing in DF2 and I would like this in df3
CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
Upvotes: 1
Views: 881
Reputation: 80
I would do:
import pandas as pd
fileLocationDF1 = "BBG.csv"
fileLocationDF2 = "corp.csv"
createDf1 = pd.read_csv(fileLocationDF1, low_memory = False)
createDf2 = pd.read_csv(fileLocationDF2, engine='python')
# df3 will have createDf1 columns with ID's that are not in createDf2
# ~ means 'not' to the filter
# Acces the column via ['COLUMN NAME'] so you can put spaces into it ;)
df3 = createDf1[~createDf1['ID'].isin(createDf2['User ID'])]
I hope this helps!
Upvotes: 2