Jdoe
Jdoe

Reputation: 111

Interpolation in Python dataframe

I am trying to perform a linear interpolation in Python such that the interpolated values are calculated off certain values in a certain column.

In my example below, I want to interpolate Measurements for Products "a", and "b" using the actual numerical values from the relevant cells in Height column in the interpolation. Could I call the Height column the index for the interpolation?

When I am interpolating the missing Measurements for Product "a", the index values to use in interpolation are 4, 5 and 7. For interpolation of Product "b", index values to use are 1, 2.2 and 3.

I know about dataframe.interpolate() but I am struggling to modify my code to use the correct index values.

Here is the code for the dataframe I am starting with:

import pandas as pd
testdata1 = [('Product', ['a', 'a', 'a', 'b', 'b', 'b','b']),
     ('Height', ['4', '5', '7', '1',  '2.2', '3','4']),
     ('Measurement', ['35.00', '', '55.00','10.00','','30.00','40.00']),
     ]
df = pd.DataFrame.from_items(testdata1)
df

testdata1

And here is the code for the dataframe I need:

targetdf = [('Product', ['a', 'a', 'a', 'b', 'b', 'b','b']),
     ('Height', ['4', '5', '7', '1',  '2.2', '3','4']),
     ('Measurement', ['35.00', '41.67', '55.00','10.00','22.00','30.00','40.00']),
     ]
df2 = pd.DataFrame.from_items(targetdf)
df2

enter image description here

If this cannot be done with a dataframe, I am open to other ideas to do it in Python.

Any help is greatly appreciated. I am new to Python. Thank you.

Upvotes: 1

Views: 1539

Answers (1)

jezrael
jezrael

Reputation: 862481

You can use first some data cleaning:

#convert '' to `NaN`s and all strings to numeric
df['Measurement'] = pd.to_numeric(df['Measurement'], errors='coerce')
#convert strings to numeric
df['Height'] =  df['Height'].astype(float)

Then set_index, groupby and apply custom function with interpolate, last reset_index and get original order of columns by reindex:

df1 = df.set_index('Height') \
        .groupby('Product') \
        .apply(lambda x: x.interpolate(method='index')) \
        .reset_index() \
        .reindex(columns=df.columns)
print (df1)
  Product  Height  Measurement
0       a     4.0    35.000000
1       a     5.0    41.666667
2       a     7.0    55.000000
3       b     1.0    10.000000
4       b     2.2    22.000000
5       b     3.0    30.000000
6       b     4.0    40.000000

Upvotes: 2

Related Questions