Interpolation of missing values

Question

I am trying to implement a for loop which loops over a dictionary. This dictionary has values extracted from a csv file. Some of the values at some rows are missing. What I am thinking of doing is taking the average of previous and nearest available next entry and assigning it to the dictionary. Sometimes there is a missing column value for consecutive rows.

Here is an example:

input.csv:

Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,,4
2020-06-28,5,,6
2020-06-29,7,8,10

The expected behaviour is:

output.csv:

Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,5.5,4
2020-06-28,5,6.75,6
2020-06-29,7,8,10

(3 + 8) / 2 = 5.5

(5.5 + 8) / 2 = 6.75

Here is I have tried:

def neighborhood(iterable):
    iterator = iter(iterable)
    previous_item = None
    current_item = next(iterator)
    for next_item in iterator:
        yield previous_item, current_item, next_item
        previous_item = current_item
        current_item = next_item
    yield previous_item, current_item, None

dictionary = {
    '2020-06-26': {'Date': '2020-06-26', 'Column_1': 1, 'Column_2': 3, 'Column_3': 5},
    '2020-06-27': {'Date': '2020-06-27', 'Column_1': 2, 'Column_3': 4},
    '2020-06-28': {'Date': '2020-06-28', 'Column_1': 5, 'Column_3': 6},
    '2020-06-29': {'Date': '2020-06-29', 'Column_1': 7, 'Column_2': 8, 'Column_3': 10}
}

field_names = {'Column_1', 'Column_2', 'Column_3'}

for previous_date, current_date, next_date in neighborhood(sorted(dictionary)):
    for field_name in field_names:
        if field_name not in dictionary[current_date]:
            dictionary[current_date][field_name] = (dictionary[previous_date][field_name] + dictionary[next_date][field_name]) / 2

Note: The question is not about how to read from a csv file or writing to a csv file. There will be a dictionary with data that I have extracted from the input csv file, and there is a code after this code snippet which will write to the output csv file. The reason why I made the dictionary has the date twice is because when reading from the input csv file I am doing this: dictionary[row['Date']] = row, I can make it a list but it will complicate the sorted function call. It is given that the first and the last rows are guaranteed to be filled completely, i.e. without missing column values. The dictionary key is a datetime object and not a string. When I am reading from the input csv file I am converting the string to a datetime object and assigning it as a key of the dictionary.

firelynx · Accepted Answer

Using Pandas you can use the interpolate() method.

import pandas as pd                                                                                                                                                                                                                                                                                                                    

df = pd.read_csv("input.csv")

The dataframe now looks like this:

         Date  Column_1  Column_2  Column_3
0  2020-06-26         1       3.0         5
1  2020-06-27         2       NaN         4
2  2020-06-28         5       NaN         6
3  2020-06-29         7       8.0        10

Using interpolate() on the column with missing data fills the gaps.

df['Column_2'].interpolate()                                                                                                                                                                                                                                                                                                            
0    3.000000
1    4.666667
2    6.333333
3    8.000000
Name: Column_2, dtype: float64

Now we can assign that back into the dataframe

df['Column_2'] = df['Column_2'].interpolate()

Results in

         Date  Column_1  Column_2  Column_3
0  2020-06-26         1  3.000000         5
1  2020-06-27         2  4.666667         4
2  2020-06-28         5  6.333333         6
3  2020-06-29         7  8.000000        10

Interpolation of missing values

Answers (2)

Related Questions