Reputation: 5408
Suppose I have dataframes like this (generated inside a loop and added to a list):
column row data_503 plate
0 1 A 1 2
1 1 B 2 2
2 1 C 3 2
3 1 D 4 2
column row data_280 plate
0 1 A 1 2
1 1 B 2 2
2 1 C 3 2
3 1 D 4 2
column row data_503 plate
0 1 A 1 1
1 1 B 2 1
2 1 C 3 1
3 1 D 4 1
column row data_280 plate
0 1 A 1 1
1 1 B 2 1
2 1 C 3 1
3 1 D 4 1
I do have a layout file which links maps the measurements to specific conditions:
column row cond plate
0 1 A 5 1
1 1 B 5 1
2 1 C 5 1
3 1 D 4 1
0 1 A 5 2
1 1 B 5 2
2 1 C 5 2
3 1 D 4 2
I can combine the dataframes like:
for df in df_list:
layout= pd.merge(layout, df, on=['plate', 'row', 'column'], how = 'outer')
However, I always get data_280_x
and data_280_y
columns but I would like to obtain only data_280
and data_503
columns. Changing outer
to left
does not change anything.
Any ideas how I could obtain something like?:
column row cond plate data_280 data_503
0 1 A 5 1 1 1
1 1 B 5 1 2 2
2 1 C 5 1 3 3
3 1 D 4 1 4 4
0 1 A 5 2 1 1
1 1 B 5 2 2 2
2 1 C 5 2 3 3
3 1 D 4 2 4 4
Upvotes: 2
Views: 8202
Reputation: 30605
After merging strip down the suffixes and apply ffill
to fill nan of previous columns over columns and drop the duplicate columns by keeping the last one so it will be completely filled i.e
layout.columns = [i.strip('_x').strip('_y') for i in layout.columns]
layout.sort_index(1).ffill(1).loc[:,~layout.sort_index(1).columns.duplicated(keep='last')]
Output:
column cond data_280 data_503 plate row 0 1 5 1 1 1 A 1 1 5 2 2 1 B 2 1 5 3 3 1 C 3 1 4 4 4 1 D 4 1 5 1 1 2 A 5 1 5 2 2 2 B 6 1 5 3 3 2 C 7 1 4 4 4 2 D
Upvotes: 1
Reputation: 2837
You can combine the _x
and _y
columns since they are not going to have any overlapping values (based on that layout df), something like this:
df['data_208'] = df['data_208_x'] + df['data_208_y']
Then you can just drop the _x
and _y
columns.
Update with example:
df1 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [1, 1, 1, 1], "data_503": [4, 5, 6, 7]})
df2 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [1, 1, 1, 1], "data_280": [1, 2, 3, 4]})
df3 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [2, 2, 2, 2], "data_503": [4, 5, 6, 7]})
df4 = pd.DataFrame({"column": [1, 1, 1, 1], "row": ["A", "B", "C", "D"], "plate": [2, 2, 2, 2], "data_280": [1, 2, 3, 4]})
layout = pd.DataFrame({"column": [1, 1, 1, 1, 1, 1, 1, 1], "row": ["A", "B", "C", "D", "A", "B", "C", "D"], "cond": [5, 5, 5, 4, 5, 5, 5, 4], "plate": [1, 1, 1, 1, 2, 2, 2, 2]})
out = []
for df in [df1, df2, df3, df4]:
_ = pd.merge(layout, df, on=['column', 'row', 'plate'], how='outer').dropna()
out.append(_)
merged = out[0]
for df in out[1:]:
merged = pd.merge(merged, df, on=['column', 'row', 'plate', 'cond'], how='outer')
merged = merged.fillna(0)
merged['data_280'] = merged['data_280_x'] + merged['data_280_y']
merged['data_503'] = merged['data_503_x'] + merged['data_503_y']
merged = merged.drop(['data_280_x','data_280_y','data_503_x','data_503_y'],1)
Gave me:
column cond plate row data_280 data_503
0 1 5 1 A 1.0 4.0
1 1 5 1 B 2.0 5.0
2 1 5 1 C 3.0 6.0
3 1 4 1 D 4.0 7.0
4 1 5 2 A 1.0 4.0
5 1 5 2 B 2.0 5.0
6 1 5 2 C 3.0 6.0
7 1 4 2 D 4.0 7.0
Upvotes: 3
Reputation: 333
I'm not sure this is the most sophisticated solution, but you could start by concatenating all data_503 and data_280 dataframes together, then merge them.
The code is not pretty, I have to run to work though :)
df_list = [df1, df2, df3, df4]
data_280_list=[]
for k in df_list:
if 'data_280' in k.columns:
data_280_list.append(k)
data_503_list=[]
for k in df_list:
if 'data_503' in k.columns:
data_503_list.append(k)
df_503= pd.concat(data_503_list)
df_280= pd.concat(data_280_list)
for df in [df_503, df_280]:
layout= pd.merge(layout, df, on=['plate', 'row', 'column'], how = 'outer')
Upvotes: 2
Reputation: 2014
Use pd.concat
to merge a list of DataFrame into a single big DataFrame.
Upvotes: 0