Stefano Potter
Stefano Potter

Reputation: 3577

Merging DataFrames of different length

My first DataFrame looks like this:

ESD    Species      Key1
Sub       POAP     SubPOAP
Sub       FRAP     SubFRAP
Sand      POAP     SandPOAP
Sand      ACER     SandACER
Sand      SALI     SandSALI

So the Key1 is the concatenation of ESD and Species.

My second DataFrame looks like this:

ESD  Species      Values     Key
Sub      POAP       10     SubPOAP 
Sub      FRAP       20     SubFRAP

So I want to join the second frame to the first based on Key1 and Key but the first DataFrame is larger and I still want to retain the columns that don't match (with ESD as Sand) in the new DataFrame. Eventually I will merge anything with an ESD of Sand too.

So far I have used this code and set 'Key1' and 'Key' as indexes:

merge=pd.merge(df, df2, left_index='True', right_index='True') 

but this does not retain columns that don't have a matching key. Would perhaps a join instead of a merge be what I need?

This code results in:

 ESD    Species      Key1      Values
 Sub       POAP     SubPOAP     10 
 Sub       FRAP     SubFRAP     20

The desired output is:

ESD    Species      Key1      Values
Sub       POAP     SubPOAP     10 
Sub       FRAP     SubFRAP     20
Sand      POAP     SandPOAP    NaN
Sand      ACER     SandACER    NaN
Sand      SALI     SandSALI    NaN

I don't really care if the duplicate column heading carry over to the join, I will just remove them (there are two columns named Species).

Upvotes: 0

Views: 2964

Answers (1)

unutbu
unutbu

Reputation: 879143

Instead of concatenating ESD and Species into a single string, it is better to merge on a list of columns:

result = pd.merge(df, df2, on=['ESD', 'Species'], how='left')

To keep one row in result for each row in df (i.e. perform a left-join), use how='left'.


import pandas as pd
try:
    # for Python2
    from cStringIO import StringIO 
except ImportError:
    # for Python3
    from io import StringIO

df = pd.read_table(StringIO('''\
ESD    Species      Key1
Sub       POAP     SubPOAP
Sub       FRAP     SubFRAP
Sand      POAP     SandPOAP
Sand      ACER     SandACER
Sand      SALI     SandSALI'''), sep='\s+')


df2 = pd.read_table(StringIO('''\
ESD  Species      Values     Key
Sub      POAP       10     SubPOAP 
Sub      FRAP       20     SubFRAP'''), sep='\s+')

result = pd.merge(df, df2, on=['ESD', 'Species'], how='left')
print(result)

yields

    ESD Species      Key1  Values      Key
0   Sub    POAP   SubPOAP      10  SubPOAP
1   Sub    FRAP   SubFRAP      20  SubFRAP
2  Sand    POAP  SandPOAP     NaN      NaN
3  Sand    ACER  SandACER     NaN      NaN
4  Sand    SALI  SandSALI     NaN      NaN

Note that if df looked like this,

ESD    Species      Key1
FOO       BAR      FOOBAR
FOOB      AR       FOOBAR

then two different (ESD, Species) pairs could be mapped to the same Key1 -- i.e. FOOBAR. Thus concatenating strings to produce a key for the purpose of merging may lead to errors.

This may or may not affect you, but since it can in theory lead to errors, and since you can merge on ['ESD', 'Species'] directly, it is better to avoid the concatenation.

Upvotes: 2

Related Questions