Reputation: 39
Using jupyter notebook on 3.6.3 |Anaconda custom (64-bit)| (default, Oct 15 2017, 03:27:45) [MSC v.1900 64 bit (AMD64)]
Consider the simple example below:
left = pd.DataFrame({'k': ['K0', 'K1', 'K2'], 'v': [1, 2, 3]}).set_index('k')
right = pd.DataFrame({'k': ['K0', 'K0', 'K3'], 'v': [4, 5, 6]}).set_index('k')
right2 = pd.DataFrame({'v': [7, 8, 9]}, index=['K1', 'K1', 'K3'])
left
right
right2
left.join(right,how='left',lsuffix='_L',rsuffix='_R')
pd.merge(left,right,how='left',right_index=True,left_index=True)
so far, so good! the last two lines produce equal results as expected, but the following line result is rather unexpected for me as it includes indices that do not belong to the left
dataframe (result seems to be an outer join):
left.join([right],how='left',lsuffix='_L',rsuffix='_R')
I noticed it uses the .merge
default suffix too, not the one I specified for .join
, and I am not getting any error. Why is that?
Also when joining more than two dataframes like below:
left.join([right,right2])
I don't understand why the result includes indices that does not belong to the left
dataframe even though this is a left join.
This can be seen in pandas documentation on join-merge
Thanks a lot!
Upvotes: 1
Views: 205
Reputation: 1106
If you inspect the code of df.join()
see on github. You will see that at some point this happens if other
is not a Dataframe
or Series
, i.e. a list
:
# join indexes only using concat
if how == 'left':
how = 'outer'
join_axes = [self.index]
else:
join_axes = None
frames = [self] + list(other)
can_concat = all(df.index.is_unique for df in frames)
if can_concat:
return concat(frames, axis=1, join=how, join_axes=join_axes,
verify_integrity=True)
joined = frames[0]
for frame in frames[1:]:
joined = merge(joined, frame, how=how, left_index=True,
right_index=True)
return joined
Thus how = 'left'
is changed to how = 'outer'
. I am not sure why this is done, but appears to be some sort of preparation for concat
(as the comment suggests); concat can only handle 'inner' or 'outer'. However in your case indices are not unique, and the for loop at the bottom of the code is executed (but still using how='outer'). This explains what you are seeing (merge like behavior with an outer join).
Of course you could use the same strategy but with how='left' directly in your code to do a series of left joins:
joined = left
for frame in [right, right2]:
joined = pd.merge(joined, frame, how='left', left_index=True, right_index=True)
Upvotes: 1
Reputation: 4848
For your first part of question (ie : I noticed it uses the .merge default suffix too, not the one I specified for .join , and I am not getting any error. Why is that? ) I don't know why, but It seems to be correct according to documentation :
Notes
-----
on, lsuffix, and rsuffix options are not supported when passing a list
of DataFrame objects
Then for your last part of question I kind of don't know. It just seems to be this way when you use a list...
Upvotes: 0