Merge and delete duplicates

Question

I have two large datasets I want to merge which have a common column, "gene".

All entries are unique in df1

in [85]: df1
Out[85]: 
         gene
0       Cdk12
1     Cdk2ap1
2        Cdk7
3        Cdk8
4        Cdx2
5       Cenpa
6       Cenpa
7       Cenpa
8      Cenpc1
9       Cenpe
10      Cenpj

df2
Out[86]: 
           gene  year                           DOI
0         Cdk12  2001              10.1038/35055500
1         Cdk12  2002           10.1038/nature01266
2         Cdk12  2002        10.1074/jbc.M106813200
3         Cdk12  2003       10.1073/pnas.1633296100
4         Cdk12  2003       10.1073/pnas.2336103100
5         Cdk12  2005            10.1093/nar/gni045
6         Cdk12  2005       10.1126/science.1112014
7         Cdk12  2008         10.1101/gr.078352.108
8         Cdk12  2011  10.1371/journal.pbio.1000582
9         Cdk12  2012       10.1074/jbc.M111.321760
10        Cdk12  2016          10.1038/cdd.2015.157
11        Cdk12  2017         10.1093/cercor/bhw081
12      Cdk2ap1  2001        10.1006/geno.2001.6474
13      Cdk2ap1  2001              10.1038/35055500
14      Cdk2ap1  2002           10.1038/nature01266

I want to keep the order of df1 because I am going to join that alongside a different dataset.

Dataframe 2 has many entries for each "gene" and I want only one for each gene.

The most recent value in "year" will decide which "gene" entry to keep.

I have tried: reading the files into pandas and then naming the columns

df1 = pd.read_csv('T1inorderforMerge.csv', header = None)
df2 = pd.read_csv('T2inorderforMerge.csv', header = None)
df1.columns = ["gene"]
df2.columns = ["gene","year","DOI"]

I have tried all variations of the code below i.e changing how and order of the df's.

df3 = pd.merge(df1, df2, on ="gene", how="left")

I have tried vertical and horizontal stacking which obvious to some, didn't work. There is lots of other messy code I have also tried but really want to see how/if I can do this using pandas.

jezrael · Accepted Answer

I think one possible solution is create helper columns which count values of gene and then merge pairs - first Cdk12 in df1 with first Cdk12 in df2, second Cdk12 with second Cdk12,... . Unique values are merged 1 to 1, classic way (because a is then always 0):

df1['a'] = df1.groupby('gene').cumcount()
df2['a'] = df2.groupby('gene').cumcount()

print (df1)
       gene  a
0     Cdk12  0
1   Cdk2ap1  0
2      Cdk7  0
3      Cdk8  0
4      Cdx2  0
5     Cenpa  0
6     Cenpa  1
7     Cenpa  2
8    Cenpc1  0
9     Cenpe  0
10    Cenpj  0

print (df2)
       gene  year                           DOI   a
0     Cdk12  2001              10.1038/35055500   0
1     Cdk12  2002           10.1038/nature01266   1
2     Cdk12  2002        10.1074/jbc.M106813200   2
3     Cdk12  2003       10.1073/pnas.1633296100   3
4     Cdk12  2003       10.1073/pnas.2336103100   4
5     Cdk12  2005            10.1093/nar/gni045   5
6     Cdk12  2005       10.1126/science.1112014   6
7     Cdk12  2008         10.1101/gr.078352.108   7
8     Cdk12  2011  10.1371/journal.pbio.1000582   8
9     Cdk12  2012       10.1074/jbc.M111.321760   9
10    Cdk12  2016          10.1038/cdd.2015.157  10
11    Cdk12  2017         10.1093/cercor/bhw081  11
12  Cdk2ap1  2001        10.1006/geno.2001.6474   0
13  Cdk2ap1  2001              10.1038/35055500   1
14  Cdk2ap1  2002           10.1038/nature01266   2

df3 = pd.merge(df1, df2, on =["a","gene"], how="left").drop('a', axis=1)
print (df3)
       gene    year                     DOI
0     Cdk12  2001.0        10.1038/35055500
1   Cdk2ap1  2001.0  10.1006/geno.2001.6474
2      Cdk7     NaN                     NaN
3      Cdk8     NaN                     NaN
4      Cdx2     NaN                     NaN
5     Cenpa     NaN                     NaN
6     Cenpa     NaN                     NaN
7     Cenpa     NaN                     NaN
8    Cenpc1     NaN                     NaN
9     Cenpe     NaN                     NaN
10    Cenpj     NaN                     NaN

Also get NaNs of all rows which not match pairs gene.

But if need process only unique values in df1['gene'] then need drop_duplicates first in both DataFrames:

df1 = df1.drop_duplicates('gene')
df2 = df2.drop_duplicates('gene')

print (df1)
      gene
0     Cdk12
1   Cdk2ap1
2      Cdk7
3      Cdk8
4      Cdx2
5     Cenpa
8    Cenpc1
9     Cenpe
10    Cenpj

print (df2)
       gene  year                     DOI
0     Cdk12  2001        10.1038/35055500
12  Cdk2ap1  2001  10.1006/geno.2001.6474

df3 = pd.merge(df1, df2, on ="gene", how="left")
print (df3)
      gene    year                     DOI
0    Cdk12  2001.0        10.1038/35055500
1  Cdk2ap1  2001.0  10.1006/geno.2001.6474
2     Cdk7     NaN                     NaN
3     Cdk8     NaN                     NaN
4     Cdx2     NaN                     NaN
5    Cenpa     NaN                     NaN
6   Cenpc1     NaN                     NaN
7    Cenpe     NaN                     NaN
8    Cenpj     NaN                     NaN

Merge and delete duplicates

Answers (2)

Related Questions