licm
licm

Reputation: 99

filling nan values in a table by interpolation between two values in a column and calculating the median, respectively

I have a pandas DataFrame looking as follows:

| ID | x | y   | z   |
| -- | - | --- | --- |
|  1 | 0 | nan | 36  |
|  1 | 1 | 12  | nan |
|  1 | 2 | nan | 38  |
|  1 | 3 | 11  | 37  |
|  2 | 0 | nan | 37  |
|  2 | 1 | nan | 37  |
|  2 | 2 | nan | nan |
|  2 | 3 | nan | nan |

I now want to fill the nan values for each ID in the following way:

  1. if values for a given ID exist, interpolate between the subsequent values (i.e.: When looking at ID 1: The value of z (in row x1) is what I'm looking for. I have z values for x0, x2 and x3, but the z value corresponding to x1 is missing. I Thus want to find a value for z (in the row of x1) by interpolating between the z values in rows x0 and x2.
  2. if no values are given for an ID (i.e.: all y values for ID 2 are nan), I want to calculate the median across the entire column (i.e.: across all y values of all IDs) and fill the nan values with that median number.

The result should be a pandas DataFrame in which all nan values are filled by the scheme as explained above. However, I am a beginner with pandas and don't know how to go about this problem to get the full DataFrame.

Upvotes: 1

Views: 678

Answers (1)

jezrael
jezrael

Reputation: 862581

Use Series.interpolate per groups for columns in list first, and then replace missing values by median in DataFrame.fillna:

cols = ['y','z']
median = df[cols].median()

df[cols] = (df.groupby('ID')[cols].transform(lambda x: x.interpolate())
              .fillna(median))

print (df)
   ID  x     y     z
0   1  0  11.5  36.0
1   1  1  12.0  37.0
2   1  2  11.5  38.0
3   1  3  11.0  37.0
4   2  0  11.5  37.0
5   2  1  11.5  37.0
6   2  2  11.5  37.0
7   2  3  11.5  37.0

Also if need specify limit_direction in interpolation use:

cols = ['y','z']
median = df[cols].median()

df[cols]= df.groupby('ID')[cols].transform(lambda x: x.interpolate(limit_direction='both'))
            .fillna(median))

print (df)
   ID  x     y     z
0   1  0  12.0  36.0
1   1  1  12.0  37.0
2   1  2  11.5  38.0
3   1  3  11.0  37.0
4   2  0  11.5  37.0
5   2  1  11.5  37.0
6   2  2  11.5  37.0
7   2  3  11.5  37.0

Upvotes: 1

Related Questions