DoubleDoverflow
DoubleDoverflow

Reputation: 21

Merging two excel files based on columns in dataframes VLookup style

My goal is to merge columns from a csv file containing either a username/ipaddress or an email/ipaddress combination after querying full username/email information. This seems like a basic inner join but is not working

example row in original file:
username     email                  ip address
NaN           [email protected]           69.0.1.127
Wilco         NaN                     69.0.12.128

example row from query output to csv file that needs to be combined:

username     email              branch

roger1       [email protected]      Los Angelas
Wilco        [email protected]     Montreal

Expected Output:

username       email              branch           ipaddress
roger1         [email protected]      Los Angelas      69.0.1.127
Wilco          [email protected]     Montreal         69.0.12.128

First part of this program is to gather data from the ip address excel sheet that has either username or email filled out, but not both... All of this is to combine username email and ipaddress at the end from this sheet and the sheet created.

directory variable = dirname

all_data = pd.DataFrame()

emailList=[]
userList = []
print "Dirname is " + dirname
#loop over excel sheets and create email and user strings to use for query
for f in glob.glob(dirname+"/Book1.xlsx"):
    df=pd.read_excel(f)
    all_data = all_data.append(df,ignore_index=True)

    all_data = all_data.append(df,ignore_index=True)
    emailList = all_data.dropna(subset=["email"])
    userList = all_data.dropna(subset=["user"])

    userList = userList["user"].tolist()
    emailList = emailList["email"].tolist()
    userList = map(int,userList)
    userList = ','.join(map(str,userList))
    emailList = "', '".join(map(str, emailList))

    emailList = "'" + emailList.upper() + "'"

all_data['email'] = all_data['email'].str.upper()

Query building logic here <>

Converted email to upper and other syntactic for an sql query for next step

con=cx_Oracle.connect("*************")
print "connection successful"
df_ora = pd.read_sql(queryStringEmail,con)
df_ora2 = pd.read_sql(queryStringUserList, con)
frames = [df_ora,df_ora2]
con.close()
newtable = pd.concat(frames)

This gives me a list of the users I need from the first excel sheet in the program.. all other rows can be ignored except the usernames/emails output in this dataframe.

nt = newtable.drop_duplicates(keep='last')

nt.to_csv("newcsv.csv", index=False, encoding='utf-8')
print "Operation successful"

Drop duplicates from second csv file as multiple of the same username/email/branch combos are useless

Lots of spaghetti code here, I apologize, but this is where the logic gets fuzzy

for f in glob.glob(dirname+"/newcsv.csv"):
    aa=pd.read_csv(f)
all_data.to_csv("newcsvALLDATA.csv", index=False, encoding='utf-8')
aa.to_csv("newcsvALLDATA2.csv", index=False, encoding='utf-8')

rename columns in the new csv file to match up with columns in original ip address csv (really just email and user)

aa.columns = ['user','email','first name','last name', 'branch', 'location']
print all_data
print aa

This is where I have tried all combinations and failed:

all_data = 

all_data.merge(aa,left_index=True,right_index=True,left_on="IP",how='inner')

print all_data
all_data.to_csv("newcsv2.csv", index=False, encoding='utf-8')
aa.to_excel(writer, sheet_name = 'x2')

I cannot print the data at the moment as I don't have access to db but can at a later point if someone has any idea

Again the goal is to output a csv file with the username/email combos matched to ipaddress csv that has either an email or username listed but not both

Upvotes: 1

Views: 1070

Answers (1)

DoubleDoverflow
DoubleDoverflow

Reputation: 21

Incase anyone runs into this issue in the future, I have solved it with the following code

aa.columns = ['user','email','first name','last name', 'provID', 'provName']
print aa
all_dataMerge = pd.merge(aa,all_data, on='user', suffixes=['1','2'])

The merge happens between both tables on 1 key but it handles both logically. Simple answer that took me forever to figure out. Hope this helps someone in the future.

   #Logic to combine the dataframes into final form, similar to VLookup
    data_final = all_dataMerge.drop_duplicates(keep='first')

Upvotes: 1

Related Questions