Reputation: 2581
I am trying to implement a for loop which loops over a dictionary. This dictionary has values extracted from a csv file. Some of the values at some rows are missing. What I am thinking of doing is taking the average of previous and nearest available next entry and assigning it to the dictionary. Sometimes there is a missing column value for consecutive rows.
Here is an example:
input.csv:
Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,,4
2020-06-28,5,,6
2020-06-29,7,8,10
The expected behaviour is:
output.csv:
Date,Column_1,Column_2,Column_3
2020-06-26,1,3,5
2020-06-27,2,5.5,4
2020-06-28,5,6.75,6
2020-06-29,7,8,10
(3 + 8) / 2 = 5.5
(5.5 + 8) / 2 = 6.75
Here is I have tried:
def neighborhood(iterable):
iterator = iter(iterable)
previous_item = None
current_item = next(iterator)
for next_item in iterator:
yield previous_item, current_item, next_item
previous_item = current_item
current_item = next_item
yield previous_item, current_item, None
dictionary = {
'2020-06-26': {'Date': '2020-06-26', 'Column_1': 1, 'Column_2': 3, 'Column_3': 5},
'2020-06-27': {'Date': '2020-06-27', 'Column_1': 2, 'Column_3': 4},
'2020-06-28': {'Date': '2020-06-28', 'Column_1': 5, 'Column_3': 6},
'2020-06-29': {'Date': '2020-06-29', 'Column_1': 7, 'Column_2': 8, 'Column_3': 10}
}
field_names = {'Column_1', 'Column_2', 'Column_3'}
for previous_date, current_date, next_date in neighborhood(sorted(dictionary)):
for field_name in field_names:
if field_name not in dictionary[current_date]:
dictionary[current_date][field_name] = (dictionary[previous_date][field_name] + dictionary[next_date][field_name]) / 2
Note: The question is not about how to read from a csv file or writing to a csv file. There will be a dictionary with data that I have extracted from the input csv file, and there is a code after this code snippet which will write to the output csv file. The reason why I made the dictionary has the date twice is because when reading from the input csv file I am doing this: dictionary[row['Date']] = row
, I can make it a list but it will complicate the sorted
function call. It is given that the first and the last rows are guaranteed to be filled completely, i.e. without missing column values. The dictionary key is a datetime
object and not a string. When I am reading from the input csv file I am converting the string to a datetime
object and assigning it as a key of the dictionary.
Upvotes: 3
Views: 902
Reputation: 536
Python
is a object-friendly language, so a little complex problem will be ok to use class
to slove.
Solution:
class frame(object):
def __init__(self, data:"list of list of object"= None):
self._data = data
self._init_str_size()
def _init_str_size(self):
# assert that it have at least one object.
assert self._data[0]
self._str_size = [0 for i in range(len(self._data[0]))]
for index, col in enumerate(self._data):
self._str_size[index] = max(self._str_size[index], len(str(col[index])))
def __str__(self):
result = []
for col in self._data:
result.append(" | ".join([
f"{str(item):>{self._str_size[index]}}" for index, item in enumerate(col)
]))
return "\n".join(result)
def _before_num(self, i, j):
return self._before_num(i - 1, j) if self._data[i][j] == None else self._data[i][j]
def _next_num(self, i, j):
return self._next_num(i + 1, j) if self._data[i][j] == None else self._data[i][j]
def fill_num(self):
for r in range(len(self._data)):
for c in range(len(self._data[r])):
if self._data[r][c] == None:
print(self._before_num(r, c), self._next_num(r, c))
self._data[r][c] = (
(self._before_num(r, c) + self._next_num(r, c)) / 2
)
self._init_str_size()
f = frame([
["2020-06-26", 1, 3.0, 5],
["2020-06-27", 2, None, 4],
["2020-06-28", 5, None, 6],
["2020-06-29", 7, 8.0, 10]
])
print(f)
# output:
# | 2020-06-26 | 1 | 3.0 | 5
# | 2020-06-27 | 2 | None | 4
# | 2020-06-28 | 5 | None | 6
# | 2020-06-29 | 7 | 8.0 | 10
# the 2-row and 3-col number's before number and next number
print(f._before_num(2 - 1, 3 - 1))
print(f._next_num(2 - 1, 3 - 1))
# output:
# | 3.0
# | 8.0
f.fill_num()
print(f)
# output:
# | 2020-06-26 | 1 | 3.0 | 5
# | 2020-06-27 | 2 | 5.5 | 4
# | 2020-06-28 | 5 | 6.75 | 6
# | 2020-06-29 | 7 | 8.0 | 10
I program a lot to try to show it in good format. But the core code to fill itself just use a little place. Enjoying youself.
Upvotes: 1
Reputation: 32224
Using Pandas you can use the interpolate()
method.
import pandas as pd
df = pd.read_csv("input.csv")
The dataframe now looks like this:
Date Column_1 Column_2 Column_3
0 2020-06-26 1 3.0 5
1 2020-06-27 2 NaN 4
2 2020-06-28 5 NaN 6
3 2020-06-29 7 8.0 10
Using interpolate()
on the column with missing data fills the gaps.
df['Column_2'].interpolate()
0 3.000000
1 4.666667
2 6.333333
3 8.000000
Name: Column_2, dtype: float64
Now we can assign that back into the dataframe
df['Column_2'] = df['Column_2'].interpolate()
Results in
Date Column_1 Column_2 Column_3
0 2020-06-26 1 3.000000 5
1 2020-06-27 2 4.666667 4
2 2020-06-28 5 6.333333 6
3 2020-06-29 7 8.000000 10
Upvotes: 2