Reputation: 2363
I have different dataframes and need to merge them together based on the date column. If I only had two dataframes, I could use df1.merge(df2, on='date')
, to do it with three dataframes, I use df1.merge(df2.merge(df3, on='date'), on='date')
, however it becomes really complex and unreadable to do it with multiple dataframes.
All dataframes have one column in common -date
, but they don't have the same number of rows nor columns and I only need those rows in which each date is common to every dataframe.
So, I'm trying to write a recursion function that returns a dataframe with all data but it didn't work. How should I merge multiple dataframes then?
I tried different ways and got errors like out of range
, keyerror 0/1/2/3
and can not merge DataFrame with instance of type <class 'NoneType'>
.
This is the script I wrote:
dfs = [df1, df2, df3] # list of dataframes
def mergefiles(dfs, countfiles, i=0):
if i == (countfiles - 2): # it gets to the second to last and merges it with the last
return
dfm = dfs[i].merge(mergefiles(dfs[i+1], countfiles, i=i+1), on='date')
return dfm
print(mergefiles(dfs, len(dfs)))
An example: df_1:
May 19, 2017;1,200.00;0.1%
May 18, 2017;1,100.00;0.1%
May 17, 2017;1,000.00;0.1%
May 15, 2017;1,901.00;0.1%
df_2:
May 20, 2017;2,200.00;1000000;0.2%
May 18, 2017;2,100.00;1590000;0.2%
May 16, 2017;2,000.00;1230000;0.2%
May 15, 2017;2,902.00;1000000;0.2%
df_3:
May 21, 2017;3,200.00;2000000;0.3%
May 17, 2017;3,100.00;2590000;0.3%
May 16, 2017;3,000.00;2230000;0.3%
May 15, 2017;3,903.00;2000000;0.3%
Expected merge result:
May 15, 2017; 1,901.00;0.1%; 2,902.00;1000000;0.2%; 3,903.00;2000000;0.3%
Upvotes: 200
Views: 520956
Reputation: 7255
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
Below, is the most clean, comprehensible way of merging multiple dataframe if complex queries aren't involved.
Just simply merge with DATE as the index and merge using OUTER method (to get all the data).
import pandas as pd
from functools import reduce
df1 = pd.read_table('file1.csv', sep=',')
df2 = pd.read_table('file2.csv', sep=',')
df3 = pd.read_table('file3.csv', sep=',')
Now, basically load all the files you have as data frame into a list. And, then merge the files using merge
or reduce
function.
# compile the list of dataframes you want to merge
data_frames = [df1, df2, df3]
Note: you can add as many data-frames inside the above list. This is the good part about this method. No complex queries involved.
To keep the values that belong to the same date you need to merge it on the DATE
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames)
# if you want to fill the values that don't exist in the lines of merged dataframe simply fill with required strings as
df_merged = reduce(lambda left,right: pd.merge(left,right,on=['DATE'],
how='outer'), data_frames).fillna('void')
Then write the merged data to the csv file if desired.
pd.DataFrame.to_csv(df_merged, 'merged.txt', sep=',', na_rep='.', index=False)
This should give you
DATE VALUE1 VALUE2 VALUE3 ....
Upvotes: 338
Reputation: 1770
You could also use dataframe.merge like this
df = df1.merge(df2).merge(df3)
UPDATE
Comparing performance of this method to the currently accepted answer
import timeit
setup = '''import pandas as pd
from functools import reduce
df_1 = pd.DataFrame({'date': {0: 'May 19, 2017', 1: 'May 18, 2017', 2: 'May 17, 2017', 3: 'May 15, 2017'}, 'a': {0: '1,200.00', 1: '1,100.00', 2: '1,000.00', 3: '1,901.00'}, 'b': {0: '0.1%', 1: '0.1%', 2: '0.1%', 3: '0.1%'}})
df_2 = pd.DataFrame({'date': {0: 'May 20, 2017', 1: 'May 18, 2017', 2: 'May 16, 2017', 3: 'May 15, 2017'}, 'a': {0: '2,200.00', 1: '2,100.00', 2: '2,000.00', 3: '2,902.00'}, 'b': {0: 1000000, 1: 1590000, 2: 1230000, 3: 1000000}, 'c': {0: '0.2%', 1: '0.2%', 2: '0.2%', 3: '0.2%'}})
df_3 = pd.DataFrame({'date': {0: 'May 21, 2017', 1: 'May 17, 2017', 2: 'May 16, 2017', 3: 'May 15, 2017'}, 'a': {0: '3,200.00', 1: '3,100.00', 2: '3,000.00', 3: '3,903.00'}, 'b': {0: 2000000, 1: 2590000, 2: 2230000, 3: 2000000}, 'c': {0: '0.3%', 1: '0.3%', 2: '0.3%', 3: '0.3%'}})
dfs = [df_1, df_2, df_3]'''
#methods from currently accepted answer
>>> timeit.timeit(setup=setup, stmt="reduce(lambda left,right: pd.merge(left,right,on=['date'], how='outer'), dfs)", number=1000)
3.3471919000148773
>>> timeit.timeit(setup=setup, stmt="df_merged = reduce(lambda left,right: pd.merge(left,right,on=['date'], how='outer'), dfs).fillna('void')", number=1000)
4.079146400094032
#method demonstrated in this answer
>>> timeit.timeit(setup=setup, stmt="df = df_1.merge(df_2, on='date').merge(df_3, on='date')", number=1000)
2.7787032001651824
Upvotes: 9
Reputation: 61
I had a similar use case and solved w/ below. Basically captured the the first df in the list, and then looped through the reminder and merged them where the result of the merge would replace the previous.
Edit: I was dealing w/ pretty small dataframes - unsure how this approach would scale to larger datasets. #caveatemptor
import pandas as pd
df_list = [df1,df2,df3, ...dfn]
# grab first dataframe
all_merged = df_list[0]
# loop through all but first data frame
for to_merge in df_list[1:]:
# result of merge replaces first or previously
# merged data frame w/ all previous fields
all_merged = pd.merge(
left=all_merged
,right=to_merge
,how='inner'
,on=['some_fld_across_all']
)
# can easily have this logic live in a function
def merge_mult_dfs(df_list):
all_merged = df_list[0]
for to_merge in df_list[1:]:
all_merged = pd.merge(
left=all_merged
,right=to_merge
,how='inner'
,on=['some_fld_across_all']
)
return all_merged
Upvotes: 1
Reputation: 877
Another way to combine: functools.reduce
From documentation:
For example,
reduce(lambda x, y: x+y, [1, 2, 3, 4, 5])
calculates ((((1+2)+3)+4)+5). The left argument, x, is the accumulated value and the right argument, y, is the update value from the iterable.
So:
from functools import reduce
dfs = [df1, df2, df3, df4, df5, df6]
df_final = reduce(lambda left,right: pd.merge(left,right,on='some_common_column_name'), dfs)
Upvotes: 19
Reputation: 1
For me the index is ignored without explicit instruction. Example:
> x = pandas.DataFrame({'a': [1,2,2], 'b':[4,5,5]})
> x
a b
0 1 4
1 2 5
2 2 5
> x.drop_duplicates()
a b
0 1 4
1 2 5
( duplicated lines removed despite different index)
Upvotes: 0
Reputation: 627
@everestial007 's solution worked for me. This is how I improved it for my use case, which is to have the columns of each different df with a different suffix so I can more easily differentiate between the dfs in the final merged dataframe.
from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, df4]
suffixes = [f"_{i}" for i in range(len(dfs))]
# add suffixes to each df
dfs = [dfs[i].add_suffix(suffixes[i]) for i in range(len(dfs))]
# remove suffix from the merging column
dfs = [dfs[i].rename(columns={f"date{suffixes[i]}":"date"}) for i in range(len(dfs))]
# merge
dfs = reduce(lambda left,right: pd.merge(left,right,how='outer', on='date'), dfs)
Upvotes: 1
Reputation: 751
functools.reduce and pd.concat are good solutions but in term of execution time pd.concat is the best.
from functools import reduce
import pandas as pd
dfs = [df1, df2, df3, ...]
nan_value = 0
# solution 1 (fast)
result_1 = pd.concat(dfs, join='outer', axis=1).fillna(nan_value)
# solution 2
result_2 = reduce(lambda df_left,df_right: pd.merge(df_left, df_right,
left_index=True, right_index=True,
how='outer'),
dfs).fillna(nan_value)
Upvotes: 44
Reputation: 111
Look at this pandas three-way joining multiple dataframes on columns
filenames = ['fn1', 'fn2', 'fn3', 'fn4',....]
dfs = [pd.read_csv(filename, index_col=index_col) for filename in filenames)]
dfs[0].join(dfs[1:])
Upvotes: 5
Reputation: 2363
Thank you for your help @jezrael, @zipa and @everestial007, both answers are what I need. If I wanted to make a recursive, this would also work as intended:
def mergefiles(dfs=[], on=''):
"""Merge a list of files based on one column"""
if len(dfs) == 1:
return "List only have one element."
elif len(dfs) == 2:
df1 = dfs[0]
df2 = dfs[1]
df = df1.merge(df2, on=on)
return df
# Merge the first and second datafranes into new dataframe
df1 = dfs[0]
df2 = dfs[1]
df = dfs[0].merge(dfs[1], on=on)
# Create new list with merged dataframe
dfl = []
dfl.append(df)
# Join lists
dfl = dfl + dfs[2:]
dfm = mergefiles(dfl, on)
return dfm
Upvotes: 1
Reputation: 2502
@dannyeuu's answer is correct. pd.concat naturally does a join on index columns, if you set the axis option to 1. The default is an outer join, but you can specify inner join too. Here is an example:
x = pd.DataFrame({'a': [2,4,3,4,5,2,3,4,2,5], 'b':[2,3,4,1,6,6,5,2,4,2], 'val': [1,4,4,3,6,4,3,6,5,7], 'val2': [2,4,1,6,4,2,8,6,3,9]})
x.set_index(['a','b'], inplace=True)
x.sort_index(inplace=True)
y = x.__deepcopy__()
y.loc[(14,14),:] = [3,1]
y['other']=range(0,11)
y.sort_values('val', inplace=True)
z = x.__deepcopy__()
z.loc[(15,15),:] = [3,4]
z['another']=range(0,22,2)
z.sort_values('val2',inplace=True)
pd.concat([x,y,z],axis=1)
Upvotes: 5
Reputation: 864
Looks like the data has the same columns, so you can:
df1 = pd.DataFrame(data1)
df2 = pd.DataFrame(data2)
merged_df = pd.concat([df1, df2])
Upvotes: 64
Reputation: 27869
If you are filtering by common date this will return it:
dfs = [df1, df2, df3]
checker = dfs[-1]
check = set(checker.loc[:, 0])
for df in dfs[:-1]:
check = check.intersection(set(df.loc[:, 0]))
print(checker[checker.loc[:, 0].isin(check)])
Upvotes: 0
Reputation: 862621
There are 2 solutions for this, but it return all columns separately:
import functools
dfs = [df1, df2, df3]
df_final = functools.reduce(lambda left,right: pd.merge(left,right,on='date'), dfs)
print (df_final)
date a_x b_x a_y b_y c_x a b c_y
0 May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
k = np.arange(len(dfs)).astype(str)
df = pd.concat([x.set_index('date') for x in dfs], axis=1, join='inner', keys=k)
df.columns = df.columns.map('_'.join)
print (df)
0_a 0_b 1_a 1_b 1_c 2_a 2_b 2_c
date
May 15,2017 900.00 0.2% 1,900.00 1000000 0.2% 2,900.00 2000000 0.2%
Upvotes: 19