Ambitions
Ambitions

Reputation: 2581

Interpolation of missing values

I am trying to implement a for loop which loops over a dictionary. This dictionary has values extracted from a csv file. Some of the values at some rows are missing. What I am thinking of doing is taking the average of previous and nearest available next entry and assigning it to the dictionary. Sometimes there is a missing column value for consecutive rows.

Here is an example:

input.csv:

Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,,4
2020-06-28,5,,6
2020-06-29,7,8,10

The expected behaviour is:

output.csv:

Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,5.5,4
2020-06-28,5,6.75,6
2020-06-29,7,8,10

(3 + 8) / 2 = 5.5

(5.5 + 8) / 2 = 6.75

Here is I have tried:

def neighborhood(iterable):
    iterator = iter(iterable)
    previous_item = None
    current_item = next(iterator)
    for next_item in iterator:
        yield previous_item, current_item, next_item
        previous_item = current_item
        current_item = next_item
    yield previous_item, current_item, None

dictionary = {
    '2020-06-26': {'Date': '2020-06-26', 'Column_1': 1, 'Column_2': 3, 'Column_3': 5},
    '2020-06-27': {'Date': '2020-06-27', 'Column_1': 2, 'Column_3': 4},
    '2020-06-28': {'Date': '2020-06-28', 'Column_1': 5, 'Column_3': 6},
    '2020-06-29': {'Date': '2020-06-29', 'Column_1': 7, 'Column_2': 8, 'Column_3': 10}
}

field_names = {'Column_1', 'Column_2', 'Column_3'}

for previous_date, current_date, next_date in neighborhood(sorted(dictionary)):
    for field_name in field_names:
        if field_name not in dictionary[current_date]:
            dictionary[current_date][field_name] = (dictionary[previous_date][field_name] + dictionary[next_date][field_name]) / 2

Note: The question is not about how to read from a csv file or writing to a csv file. There will be a dictionary with data that I have extracted from the input csv file, and there is a code after this code snippet which will write to the output csv file. The reason why I made the dictionary has the date twice is because when reading from the input csv file I am doing this: dictionary[row['Date']] = row, I can make it a list but it will complicate the sorted function call. It is given that the first and the last rows are guaranteed to be filled completely, i.e. without missing column values. The dictionary key is a datetime object and not a string. When I am reading from the input csv file I am converting the string to a datetime object and assigning it as a key of the dictionary.

Upvotes: 3

Views: 902

Answers (2)

Peterlits Zo
Peterlits Zo

Reputation: 536

Python is a object-friendly language, so a little complex problem will be ok to use class to slove.

Solution:

class frame(object):
    def __init__(self, data:"list of list of object"= None):
        self._data = data
        self._init_str_size()

    def _init_str_size(self):
        # assert that it have at least one object.
        assert self._data[0]
        self._str_size = [0 for i in range(len(self._data[0]))]
        for index, col in enumerate(self._data):
            self._str_size[index] = max(self._str_size[index], len(str(col[index])))

    def __str__(self):
        result = []
        for col in self._data:
            result.append(" | ".join([
                f"{str(item):>{self._str_size[index]}}" for index, item in enumerate(col)
            ]))
        return "\n".join(result)

    def _before_num(self, i, j):
        return self._before_num(i - 1, j) if self._data[i][j] == None else self._data[i][j]

    def _next_num(self, i, j):
        return self._next_num(i + 1, j) if self._data[i][j] == None else self._data[i][j]

    def fill_num(self):
        for r in range(len(self._data)):
            for c in range(len(self._data[r])):
                if self._data[r][c] == None:
                    print(self._before_num(r, c), self._next_num(r, c))
                    self._data[r][c] = (
                        (self._before_num(r, c) + self._next_num(r, c)) / 2
                    )
        self._init_str_size()

f = frame([
    ["2020-06-26", 1,  3.0,  5],
    ["2020-06-27", 2, None,  4],
    ["2020-06-28", 5, None,  6],
    ["2020-06-29", 7,  8.0, 10]
])

print(f)
# output:
#   | 2020-06-26 | 1 |  3.0 |  5
#   | 2020-06-27 | 2 | None |  4
#   | 2020-06-28 | 5 | None |  6
#   | 2020-06-29 | 7 |  8.0 | 10

# the 2-row and 3-col number's before number and next number
print(f._before_num(2 - 1, 3 - 1))
print(f._next_num(2 - 1, 3 - 1))
# output:
#   | 3.0
#   | 8.0

f.fill_num()
print(f)
# output:
#   | 2020-06-26 | 1 |   3.0 |  5
#   | 2020-06-27 | 2 |   5.5 |  4
#   | 2020-06-28 | 5 |  6.75 |  6
#   | 2020-06-29 | 7 |   8.0 | 10

I program a lot to try to show it in good format. But the core code to fill itself just use a little place. Enjoying youself.

Upvotes: 1

firelynx
firelynx

Reputation: 32224

Using Pandas you can use the interpolate() method.

import pandas as pd                                                                                                                                                                                                                                                                                                                    

df = pd.read_csv("input.csv")                                                                                                                                                                                                                                                                                                                  

The dataframe now looks like this:

         Date  Column_1  Column_2  Column_3
0  2020-06-26         1       3.0         5
1  2020-06-27         2       NaN         4
2  2020-06-28         5       NaN         6
3  2020-06-29         7       8.0        10

Using interpolate() on the column with missing data fills the gaps.

df['Column_2'].interpolate()                                                                                                                                                                                                                                                                                                            
0    3.000000
1    4.666667
2    6.333333
3    8.000000
Name: Column_2, dtype: float64

Now we can assign that back into the dataframe

df['Column_2'] = df['Column_2'].interpolate()                                                                                                                                                                                                                                                                                          

Results in

         Date  Column_1  Column_2  Column_3
0  2020-06-26         1  3.000000         5
1  2020-06-27         2  4.666667         4
2  2020-06-28         5  6.333333         6
3  2020-06-29         7  8.000000        10

Upvotes: 2

Related Questions