pdubois
pdubois

Reputation: 7800

Merging list of DFs with alternating columns output using Pandas

I have the following codes:

import pandas as pd

rep1 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP1',[1.00,23.22,11.12]),('RP1',["A","B","C"])   ], orient='columns')
rep2 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'z']), ('Gene', ['foo', 'bar', 'qux']), ('RP2',[3.33,77.22,18.12]),('RP2',["G","I","K"])   ], orient='columns')
rep3 = pd.DataFrame.from_items([('Probe', ['x', 'y', 'k']), ('Gene', ['foo', 'bar', 'kux']), ('RP3',[99.99,98.29,8.10]),('RP2',["M","P","J"]) ], orient='columns')

tmp = []
tmp.append(rep1)
tmp.append(rep2)
tmp.append(rep3)

Which produces the following list of data frames.

In [56]: tmp
Out[56]:
 [  Probe Gene    RP1 RP1
 0     x  foo   1.00   A
 1     y  bar  23.22   B
 2     z  qux  11.12   C,   Probe Gene    RP2 RP2
 0     x  foo   3.33   G
 1     y  bar  77.22   I
 2     z  qux  18.12   K,   Probe Gene    RP3 RP2
 0     x  foo  99.99   M
 1     y  bar  98.29   P
 2     k  kux   8.10   J]

Each of the data frames above has the following characteristics:

  1. Always contain 4 columns
  2. Second and last column has identical names
  3. First two columns always named Probe and Gene
  4. Content of Probe and Gene always appear consistently, ie. 'x' always goes with 'foo'.

I'm trying to merge those DFs in the list so that it produces this:

  Probe Gene    RP1     RP2  RP3    RP1  RP2  RP3
0     x  foo   1.00    3.33  99.99    A    G   M
1     y  bar  23.22   77.22  98.29    B    I   P
2     z  qux  11.12   18.12   NA      C    K   NA
3     k  kux     NA      NA  8.10     NA   NA  J

I tried this code but failed:

In [67]: reduce(pd.merge,tmp)
MergeError: Left data columns not unique: Index([u'Probe', u'Gene', u'RP1', u'RP1'], dtype='object')

What's the right way to do it?

Upvotes: 1

Views: 253

Answers (1)

Andy Hayden
Andy Hayden

Reputation: 375675

You could dedupe the column names. Here's a kind of hacky way:

In [11]: list(rep1.columns[0:2]) + [rep1.columns[2] + "_value"] + [rep1.columns[2] + "_letter"]
Out[11]: ['Probe', 'Gene', 'RP1_value', 'RP1_letter']

In [12]: for rep in tmp:
   .....:     rep.columns = list(rep.columns[0:2]) + [rep.columns[2] + "_value"] + [rep.columns[2] + "_letter"]

In [13]: reduce(pd.merge,tmp)
Out[13]:
  Probe Gene  RP1_value RP1_letter  RP2_value RP2_letter  RP3_value RP3_letter
0     x  foo       1.00          A       3.33          G      99.99          M
1     y  bar      23.22          B      77.22          I      98.29          P

You also need to specify it as an outer merge (to get the NaN rows):

In [21]: reduce(lambda x, y: pd.merge(x, y, how='outer'),tmp)
Out[21]:
  Probe Gene  RP1_value RP1_letter  RP2_value RP2_letter  RP3_value RP3_letter
0     x  foo       1.00          A       3.33          G      99.99          M
1     y  bar      23.22          B      77.22          I      98.29          P
2     z  qux      11.12          C      18.12          K        NaN        NaN
3     k  kux        NaN        NaN        NaN        NaN       8.10          J

Upvotes: 1

Related Questions