SethMMorton
SethMMorton

Reputation: 48735

Is there a simple DataFrame method to copy values in a column row-wise based on values in another column in another row?

I have a DataFrame with a column whose data is dependent on the values in another column. Unfortunately, the source from where I am collecting the data only provides the values for the second column ('job_id') the first time the value for the first column ('host_id') is given. The result is that my 'job_id' has a lot of NaN values.

In [1]: import pandas as pd, numpy as np

In [2]: df = pd.DataFrame({'run_id' : range(10),
   ...:                    'host_id': ['a', 'b', 'c', 'd', 'e', 'a', 'd', 'c', 'a', 'e'],
   ...:                    'job_id': [100253, 100254, 100255, 100256, 100257, np.nan, np.nan, np.nan, np.nan, np.nan]})

In [3]: df
Out[3]: 
  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a       NaN       5
6       d       NaN       6
7       c       NaN       7
8       a       NaN       8
9       e       NaN       9

The desired output would be to have 'job_id' repeat in the same way as 'host_id':

  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a  100253.0       5
6       d  100256.0       6
7       c  100255.0       7
8       a  100253.0       8
9       e  100257.0       9

The solution that I came up with is to extract just the 'host_id' and 'job_id' columns, remove the rows with NaN, use a left merge on the original DataFrame, and then rename/reorder the resulting columns.

In [3]: host_job_mapping = df[['host_id', 'job_id']].dropna(subset=['job_id'])

In [4]: host_job_mapping
Out[4]: 
  host_id    job_id
0       a  100253.0
1       b  100254.0
2       c  100255.0
3       d  100256.0
4       e  100257.0

In [5]: df = pd.merge(df, host_job_mapping, how='left', on='host_id')

In [6]: df
Out[6]: 
  host_id  job_id_x  run_id  job_id_y
0       a  100253.0       0  100253.0
1       b  100254.0       1  100254.0
2       c  100255.0       2  100255.0
3       d  100256.0       3  100256.0
4       e  100257.0       4  100257.0
5       a       NaN       5  100253.0
6       d       NaN       6  100256.0
7       c       NaN       7  100255.0
8       a       NaN       8  100253.0
9       e       NaN       9  100257.0

In [7]: df = df.rename(columns={'job_id_y': 'job_id'})[['host_id', 'job_id', 'run_id']]

In [8]: df
Out[8]: 
  host_id    job_id  run_id
0       a  100253.0       0
1       b  100254.0       1
2       c  100255.0       2
3       d  100256.0       3
4       e  100257.0       4
5       a  100253.0       5
6       d  100256.0       6
7       c  100255.0       7
8       a  100253.0       8
9       e  100257.0       9

While this works, it does not seem particularly elegant. Is there an easier or more straightforward way to achieve this (without resorting to apply)?

Upvotes: 1

Views: 239

Answers (1)

akuiper
akuiper

Reputation: 214957

You can group by host_id and then do a forward fill:

df.groupby('host_id', as_index=False).ffill()

#  host_id    job_id    run_id
#0       a  100253.0    0
#1       b  100254.0    1
#2       c  100255.0    2
#3       d  100256.0    3
#4       e  100257.0    4
#5       a  100253.0    5
#6       d  100256.0    6
#7       c  100255.0    7
#8       a  100253.0    8
#9       e  100257.0    9

If there might be missing values in other columns:

df['job_id'] = df.job_id.groupby(df.host_id).ffill()

Or following your original approach, firstly get the relation between host_id and job_id and then use map to get the job_id from host_id:

df.job_id = df.host_id.map(df.set_index('host_id').job_id.dropna())

Upvotes: 1

Related Questions