Reputation: 724
I have two large datasets I want to merge which have a common column, "gene".
All entries are unique in df1
in [85]: df1
Out[85]:
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
6 Cenpa
7 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
df2
Out[86]:
gene year DOI
0 Cdk12 2001 10.1038/35055500
1 Cdk12 2002 10.1038/nature01266
2 Cdk12 2002 10.1074/jbc.M106813200
3 Cdk12 2003 10.1073/pnas.1633296100
4 Cdk12 2003 10.1073/pnas.2336103100
5 Cdk12 2005 10.1093/nar/gni045
6 Cdk12 2005 10.1126/science.1112014
7 Cdk12 2008 10.1101/gr.078352.108
8 Cdk12 2011 10.1371/journal.pbio.1000582
9 Cdk12 2012 10.1074/jbc.M111.321760
10 Cdk12 2016 10.1038/cdd.2015.157
11 Cdk12 2017 10.1093/cercor/bhw081
12 Cdk2ap1 2001 10.1006/geno.2001.6474
13 Cdk2ap1 2001 10.1038/35055500
14 Cdk2ap1 2002 10.1038/nature01266
I want to keep the order of df1 because I am going to join that alongside a different dataset.
Dataframe 2 has many entries for each "gene" and I want only one for each gene.
The most recent value in "year" will decide which "gene" entry to keep.
I have tried: reading the files into pandas and then naming the columns
df1 = pd.read_csv('T1inorderforMerge.csv', header = None)
df2 = pd.read_csv('T2inorderforMerge.csv', header = None)
df1.columns = ["gene"]
df2.columns = ["gene","year","DOI"]
I have tried all variations of the code below i.e changing how and order of the df's.
df3 = pd.merge(df1, df2, on ="gene", how="left")
I have tried vertical and horizontal stacking which obvious to some, didn't work. There is lots of other messy code I have also tried but really want to see how/if I can do this using pandas.
Upvotes: 3
Views: 1763
Reputation: 11
Not sure what is type(df1), but:
In [1]: df1 = ['a', 'f', 'g']
In [2]: df2 = [['a', 7, True], ['g',8, False]]
In [3]: [[inner_item for inner_item in df2 if inner_item[0] == outer_item][0] if len([inner_item for inner_item in df2 if inner_item[0] == outer_item])>0 else [outer_item,None,None] for outer_item in df1]
Out[3]: [['a', 7, True], ['f', None, None], ['g', 8, False]]
Upvotes: 1
Reputation: 863611
I think one possible solution is create helper columns which count values of gene
and then merge pairs - first Cdk12
in df1
with first Cdk12
in df2
, second Cdk12
with second Cdk12
,... . Unique values are merged 1 to 1, classic way (because a
is then always 0
):
df1['a'] = df1.groupby('gene').cumcount()
df2['a'] = df2.groupby('gene').cumcount()
print (df1)
gene a
0 Cdk12 0
1 Cdk2ap1 0
2 Cdk7 0
3 Cdk8 0
4 Cdx2 0
5 Cenpa 0
6 Cenpa 1
7 Cenpa 2
8 Cenpc1 0
9 Cenpe 0
10 Cenpj 0
print (df2)
gene year DOI a
0 Cdk12 2001 10.1038/35055500 0
1 Cdk12 2002 10.1038/nature01266 1
2 Cdk12 2002 10.1074/jbc.M106813200 2
3 Cdk12 2003 10.1073/pnas.1633296100 3
4 Cdk12 2003 10.1073/pnas.2336103100 4
5 Cdk12 2005 10.1093/nar/gni045 5
6 Cdk12 2005 10.1126/science.1112014 6
7 Cdk12 2008 10.1101/gr.078352.108 7
8 Cdk12 2011 10.1371/journal.pbio.1000582 8
9 Cdk12 2012 10.1074/jbc.M111.321760 9
10 Cdk12 2016 10.1038/cdd.2015.157 10
11 Cdk12 2017 10.1093/cercor/bhw081 11
12 Cdk2ap1 2001 10.1006/geno.2001.6474 0
13 Cdk2ap1 2001 10.1038/35055500 1
14 Cdk2ap1 2002 10.1038/nature01266 2
df3 = pd.merge(df1, df2, on =["a","gene"], how="left").drop('a', axis=1)
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpa NaN NaN
7 Cenpa NaN NaN
8 Cenpc1 NaN NaN
9 Cenpe NaN NaN
10 Cenpj NaN NaN
Also get NaN
s of all rows which not match pairs gene
.
But if need process only unique values in df1['gene']
then need drop_duplicates
first in both DataFrames:
df1 = df1.drop_duplicates('gene')
df2 = df2.drop_duplicates('gene')
print (df1)
gene
0 Cdk12
1 Cdk2ap1
2 Cdk7
3 Cdk8
4 Cdx2
5 Cenpa
8 Cenpc1
9 Cenpe
10 Cenpj
print (df2)
gene year DOI
0 Cdk12 2001 10.1038/35055500
12 Cdk2ap1 2001 10.1006/geno.2001.6474
df3 = pd.merge(df1, df2, on ="gene", how="left")
print (df3)
gene year DOI
0 Cdk12 2001.0 10.1038/35055500
1 Cdk2ap1 2001.0 10.1006/geno.2001.6474
2 Cdk7 NaN NaN
3 Cdk8 NaN NaN
4 Cdx2 NaN NaN
5 Cenpa NaN NaN
6 Cenpc1 NaN NaN
7 Cenpe NaN NaN
8 Cenpj NaN NaN
Upvotes: 3