Reputation: 9233
I have a MasterList
dataframe
that I merge other data sets into via a loop. Each time I merge a new column is created as with _x
or _y
. How can I just keep these as one column?
import pandas as pd
MasterList = pd.DataFrame(data = [['0001'],['0002'], ['0003'], ['0004']], columns = ['Order Number'])
customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])
for x in [customer_file1, customer_file2]:
MasterList = pd.merge(MasterList, x, how='left',left_on= 'Order Number',right_on= 'Order Number')
print MasterList
Output:
Order Number Day_x Day_y
0 0001 NaN T
1 0002 NaN S
2 0003 M NaN
3 0004 W NaN
Desired Output:
Order Number Day
0 0001 T
1 0002 S
2 0003 M
3 0004 W
Edit: People wanted more data as I over simplified my example: I know year and day don't really make sense in the data set buy that is ok. Each customer file really come from a query of a different database, so I'd like to make the query from the database then merge in the data and forget about it rather than querying all the customer databases, concatenating, then merging.
import pandas as pd
MasterList = pd.DataFrame(data = [['0001', '2015'],['0002', '2015'], ['0003', '2016'], ['0004', '2015'], ['0005', '2017'], ['0006', '2018']], columns = ['Order Number', 'Year'])
customer_file1 = pd.DataFrame(data = [['0003', 'M'], ['0004', 'W']], columns = ['Order Number', 'Day'])
customer_file2 = pd.DataFrame(data = [['0001', 'T'], ['0002', 'S']], columns = ['Order Number', 'Day'])
customer_file3 = pd.DataFrame(data = [['0005', 'T'], ['0006', 'S']], columns = ['Order Number', 'Day'])
for x in [customer_file1, customer_file2, customer_file3]:
MasterList = pd.merge(MasterList, x, how='left', left_on='Order Number', right_on='Order Number')
print MasterList
output:
Order Number Year Day_x Day_y Day
0 0001 2015 NaN T NaN
1 0002 2015 NaN S NaN
2 0003 2016 M NaN NaN
3 0004 2015 W NaN NaN
4 0005 2017 NaN NaN T
5 0006 2018 NaN NaN S
Desired output:
Order Number Year Day
0 0001 2015 T
1 0002 2015 S
2 0003 2016 M
3 0004 2015 W
4 0005 2017 T
5 0006 2018 S
Upvotes: 0
Views: 399
Reputation: 323306
Base on your out put , just do following , you get your Desired output.
df.apply(lambda x: sorted(x, key=pd.isnull), 1).dropna(1)
Out[126]:
Order Number Year Day_x
0 0 1 2015 T
1 1 2 2015 S
2 2 3 2016 M
3 3 4 2015 W
4 4 5 2017 T
5 5 6 2018 S
Upvotes: 0
Reputation: 76927
Option 1] Use map
and combine_first
In [5044]: MasterList['Day'] = np.nan
...: for x in [customer_file1, customer_file2]:
...: day = MasterList['Order Number'].map(x.set_index('Order Number')['Day'])
...: MasterList['Day'] = MasterList['Day'].combine_first(day)
...:
In [5045]: MasterList
Out[5045]:
Order Number Day
0 0001 T
1 0002 S
2 0003 M
3 0004 W
Options 2]
Use merge
and append
In [5032]: MasterList.merge(customer_file1.append(customer_file2))
Out[5032]:
Order Number Day
0 0001 T
1 0002 S
2 0003 M
3 0004 W
Or use merge
and concat
In [5033]: MasterList.merge(pd.concat([customer_file1, customer_file2]))
Out[5033]:
Order Number Day
0 0001 T
1 0002 S
2 0003 M
3 0004 W
Upvotes: 2
Reputation: 30605
Common mistake of merge use concat instead i.e
MasterList = pd.concat([customer_file2,customer_file1],ignore_index=True)
Order Number Day 0 0001 T 1 0002 S 2 0003 M 3 0004 W
Upvotes: 2