Reputation: 83
I have a dataframe with multiple pairs of columns that have to be merged. The columns contain mutually exclusive data. That is, if there is a value in Column A, the value for that row in Column B will be empty.
df = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', '', 'A2', ''],
'B': ['', 'B1', '', 'B3'],
'C': ['C1','C2','',''],
'D': ['','','D3','D4']})
So I have something like this:
A B C D key
0 A0 C1 K0
1 B1 C2 K1
2 A2 D3 K2
3 B3 D4 K3
I'd like to merge columns A and B so all values end up in column A. I also want to do this form C and D, while keeping the index and any other columns such as Key untouched. I'm fine doing this in multiple steps. I don't need to do A-B merge and the C-D merge at the same time. Ideally, I would end up with:
A C key
0 A0 C1 K0
1 B1 C2 K1
2 A2 D3 K2
3 B3 D4 K3
I've tried df = df.A.combine_first(df.B)
but that gets me nowhere.
Upvotes: 0
Views: 8394
Reputation: 16987
If you're fine with converting the empty strings to NaN first you could try using Series.fillna by doing something like this:
import numpy as np
df = df.replace('', np.nan)
df.A = df.A.fillna(df.B)
df.C = df.C.fillna(df.D)
which gives:
df
A B C D key
0 A0 NaN C1 NaN K0
1 B1 B1 C2 NaN K1
2 A2 NaN D3 D3 K2
3 B3 B3 D4 D4 K3
Which will fill the missing values in column A with those in column B then set the resulting series to column A in df
. We can extend this idea for the case of multiple columns by using a loop:
l_cols = df.columns[0::2] # Left columns in the pairs
r_cols = df.columns[1::2] # Right columns in the pairs
# Go through each pair
for l, r in zip(l_cols, r_cols):
df[l] = df[r].fillna(df[r])
The columns used for merging can be excluded from the output using:
df = df.drop(r_cols, axis=1)
Upvotes: 2
Reputation: 30605
Here is a solution using zip to match every two columns
li = zip(df.columns[0::2],df.columns[1::2])
#[('A', 'B'), ('C', 'D')]
# I assume columns are pairs and end up with lenght as odd number with additional column.
# If you want to ignore last column manually you can use
# li = zip(df.columns[0:-1:2],df.columns[1:-1:2]) # slice `start:end:step`
temp = pd.DataFrame({i :df[i]+df[j] for i,j in li})
ndf = pd.concat([temp,df['key']],1)
# A C key
# 0 A0 C1 K0
# 1 B1 C2 K1
# 2 A2 D3 K2
# 3 B3 D4 K3
Upvotes: 5