Reputation: 48735
I have a DataFrame
with a column whose data is dependent on the values in another column. Unfortunately, the source from where I am collecting the data only provides the values for the second column ('job_id'
) the first time the value for the first column ('host_id'
) is given. The result is that my 'job_id'
has a lot of NaN
values.
In [1]: import pandas as pd, numpy as np
In [2]: df = pd.DataFrame({'run_id' : range(10),
...: 'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'],
...: 'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]})
In [3]: df
Out[3]:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a NaN 5
6 d NaN 6
7 c NaN 7
8 a NaN 8
9 e NaN 9
The desired output would be to have 'job_id'
repeat in the same way as 'host_id'
:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a 100253.0 5
6 d 100256.0 6
7 c 100255.0 7
8 a 100253.0 8
9 e 100257.0 9
The solution that I came up with is to extract just the 'host_id'
and 'job_id'
columns, remove the rows with NaN
, use a left merge on the original DataFrame, and then rename/reorder the resulting columns.
In [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id'])
In [4]: host_job_mapping
Out[4]:
host_id job_id
0 a 100253.0
1 b 100254.0
2 c 100255.0
3 d 100256.0
4 e 100257.0
In [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id')
In [6]: df
Out[6]:
host_id job_id_x run_id job_id_y
0 a 100253.0 0 100253.0
1 b 100254.0 1 100254.0
2 c 100255.0 2 100255.0
3 d 100256.0 3 100256.0
4 e 100257.0 4 100257.0
5 a NaN 5 100253.0
6 d NaN 6 100256.0
7 c NaN 7 100255.0
8 a NaN 8 100253.0
9 e NaN 9 100257.0
In [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']]
In [8]: df
Out[8]:
host_id job_id run_id
0 a 100253.0 0
1 b 100254.0 1
2 c 100255.0 2
3 d 100256.0 3
4 e 100257.0 4
5 a 100253.0 5
6 d 100256.0 6
7 c 100255.0 7
8 a 100253.0 8
9 e 100257.0 9
While this works, it does not seem particularly elegant. Is there an easier or more straightforward way to achieve this (without resorting to apply
)?
Upvotes: 1
Views: 239
Reputation: 214957
You can group by host_id
and then do a forward fill:
df.groupby('host_id', as_index=False).ffill()
# host_id job_id run_id
#0 a 100253.0 0
#1 b 100254.0 1
#2 c 100255.0 2
#3 d 100256.0 3
#4 e 100257.0 4
#5 a 100253.0 5
#6 d 100256.0 6
#7 c 100255.0 7
#8 a 100253.0 8
#9 e 100257.0 9
If there might be missing values in other columns:
df['job_id'] = df.job_id.groupby(df.host_id).ffill()
Or following your original approach, firstly get the relation between host_id and job_id and then use map
to get the job_id
from host_id
:
df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna())
Upvotes: 1