Moritz
Moritz

Reputation: 5408

pandas merge dataframes generated inside a loop

Suppose I have dataframes like this (generated inside a loop and added to a list):

column  row data_503    plate
0   1   A   1   2
1   1   B   2   2
2   1   C   3   2
3   1   D   4   2

column  row data_280    plate
0   1   A   1   2
1   1   B   2   2
2   1   C   3   2
3   1   D   4   2

column  row data_503    plate
0   1   A   1   1
1   1   B   2   1
2   1   C   3   1
3   1   D   4   1

column  row data_280    plate
0   1   A   1   1
1   1   B   2   1
2   1   C   3   1
3   1   D   4   1

I do have a layout file which links maps the measurements to specific conditions:

column  row cond    plate
0   1   A   5   1
1   1   B   5   1
2   1   C   5   1
3   1   D   4   1
0   1   A   5   2
1   1   B   5   2
2   1   C   5   2
3   1   D   4   2

I can combine the dataframes like:

for df in df_list:
    layout= pd.merge(layout, df, on=['plate', 'row', 'column'], how = 'outer')

However, I always get data_280_x and data_280_y columns but I would like to obtain only data_280 and data_503 columns. Changing outer to left does not change anything.

Any ideas how I could obtain something like?:

column  row cond    plate    data_280    data_503
0   1   A   5   1    1    1
1   1   B   5   1    2    2
2   1   C   5   1    3    3
3   1   D   4   1    4    4
0   1   A   5   2    1    1
1   1   B   5   2    2    2
2   1   C   5   2    3    3
3   1   D   4   2    4    4

Upvotes: 2

Views: 8202

Answers (4)

Bharath M Shetty
Bharath M Shetty

Reputation: 30605

After merging strip down the suffixes and apply ffill to fill nan of previous columns over columns and drop the duplicate columns by keeping the last one so it will be completely filled i.e

layout.columns  = [i.strip('_x').strip('_y') for i in layout.columns]
layout.sort_index(1).ffill(1).loc[:,~layout.sort_index(1).columns.duplicated(keep='last')]

Output:

   column cond data_280 data_503 plate row
0      1    5        1        1     1   A
1      1    5        2        2     1   B
2      1    5        3        3     1   C
3      1    4        4        4     1   D
4      1    5        1        1     2   A
5      1    5        2        2     2   B
6      1    5        3        3     2   C
7      1    4        4        4     2   D

Upvotes: 1

danielfrg
danielfrg

Reputation: 2837

You can combine the _x and _y columns since they are not going to have any overlapping values (based on that layout df), something like this:

df['data_208'] = df['data_208_x'] + df['data_208_y']

Then you can just drop the _x and _y columns.

Update with example:

df1 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [1, 1, 1, 1], "data_503": [4, 5, 6, 7]})
df2 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [1, 1, 1, 1], "data_280": [1, 2, 3, 4]})
df3 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [2, 2, 2, 2], "data_503": [4, 5, 6, 7]})
df4 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [2, 2, 2, 2], "data_280": [1, 2, 3, 4]})
layout = pd.DataFrame({"column": [1, 1, 1, 1, 1, 1, 1, 1], "row": ["A", "B", "C", "D", "A", "B", "C", "D"], "cond": [5, 5, 5, 4, 5, 5, 5, 4], "plate": [1, 1, 1, 1, 2, 2, 2, 2]})

out = []
for df in [df1, df2, df3, df4]:
    _ = pd.merge(layout, df, on=['column', 'row', 'plate'], how='outer').dropna()
    out.append(_)

merged = out[0]
for df in out[1:]:
    merged = pd.merge(merged, df, on=['column', 'row', 'plate', 'cond'], how='outer')

merged = merged.fillna(0)

merged['data_280'] = merged['data_280_x'] + merged['data_280_y']
merged['data_503'] = merged['data_503_x'] + merged['data_503_y']

merged = merged.drop(['data_280_x','data_280_y','data_503_x','data_503_y'],1)

Gave me:

column  cond  plate row  data_280  data_503
0       1     5      1   A       1.0       4.0
1       1     5      1   B       2.0       5.0
2       1     5      1   C       3.0       6.0
3       1     4      1   D       4.0       7.0
4       1     5      2   A       1.0       4.0
5       1     5      2   B       2.0       5.0
6       1     5      2   C       3.0       6.0
7       1     4      2   D       4.0       7.0

Upvotes: 3

Simon
Simon

Reputation: 333

I'm not sure this is the most sophisticated solution, but you could start by concatenating all data_503 and data_280 dataframes together, then merge them.

The code is not pretty, I have to run to work though :)

df_list = [df1, df2, df3, df4]

data_280_list=[]
for k in df_list:
    if 'data_280' in k.columns:
        data_280_list.append(k)

data_503_list=[]
for k in df_list:
    if 'data_503' in k.columns:
        data_503_list.append(k)


df_503= pd.concat(data_503_list)
df_280= pd.concat(data_280_list)

for df in [df_503, df_280]:
    layout= pd.merge(layout, df, on=['plate', 'row', 'column'], how = 'outer')

Upvotes: 2

liam
liam

Reputation: 2014

Use pd.concat to merge a list of DataFrame into a single big DataFrame.

Upvotes: 0

Related Questions