Arun
Arun

Reputation: 2478

fancyimpute Python 3 MemoryError

I have a CSV file containing lots of missing values. I am trying to use 'fancyimpute' package to impute the missing values using 'KNN()' method.

The pandas DataFrame containing the CSV file has 7 attributes/columns while the 8th attribute which is 'time' but is used as index for the DataFrame.

data.shape

# (83070, 7)

data.isnull().sum().sum()

# 59926

data.isnull().sum()

'''
A        171
B        0
C        0
D        47441
E        170
F        12144
G        0
dtype: int64
'''

When I use the following code for data imputation-

filled_data_na = KNN(k = 3).fit_transform(data)

It gives me the following error-

MemoryError Traceback (most recent call last) in ----> 1 filled_na = KNN(k = 3).fit_transform(data_date_idx)

~/.local/lib/python3.6/site-packages/fancyimpute/solver.py in fit_transform(self, X, y) 187 type(X_filled))) 188 --> 189 X_result = self.solve(X_filled, missing_mask) 190 if not isinstance(X_result, np.ndarray): 191 raise TypeError(

~/.local/lib/python3.6/site-packages/fancyimpute/knn.py in solve(self, X, missing_mask) 102 k=self.k, 103 verbose=self.verbose, --> 104 print_interval=self.print_interval) 105 106 failed_to_impute = np.isnan(X_imputed)

~/.local/lib/python3.6/site-packages/knnimpute/few_observed_entries.py in knn_impute_few_observed(X, missing_mask, k, verbose, print_interval) 49 X_column_major = X.copy(order="F") 50 X_row_major, D, effective_infinity = \ ---> 51 knn_initialize(X, missing_mask, verbose=verbose) 52 # get rid of infinities, replace them with a very large number 53 D_sorted = np.argsort(D, axis=1)

~/.local/lib/python3.6/site-packages/knnimpute/common.py in knn_initialize(X, missing_mask, verbose, min_dist, max_dist_multiplier) 37 # to put NaN's back in the data matrix for the distances function 38 X_row_major[missing_mask] = np.nan ---> 39 D = all_pairs_normalized_distances(X_row_major) 40 D_finite_flat = D[np.isfinite(D)] 41 if len(D_finite_flat) > 0:

~/.local/lib/python3.6/site-packages/knnimpute/normalized_distance.py in all_pairs_normalized_distances(X) 36 37 # matrix of mean squared difference between between samples ---> 38 D = np.ones((n_rows, n_rows), dtype="float32", order="C") * np.inf 39 40 # we can cheaply determine the number of columns that two rows share

~/.local/lib/python3.6/site-packages/numpy/core/numeric.py in ones(shape, dtype, order) 221 222 """ --> 223 a = empty(shape, dtype, order) 224 multiarray.copyto(a, 1, casting='unsafe') 225 return a

MemoryError:

Any ideas as to what's going wrong?

Thanks!

Upvotes: 1

Views: 598

Answers (1)

tzujan
tzujan

Reputation: 186

I am not that familiar with fancyimpute, however, using the pandas chunksize to iterate though, can solve memory related problems. Basically, the chunk size gives you a 'textreader object' that you can iterate over.

for chunk in pd.read_csv('my_csv.csv', chunksize=1000):

Another option, that could work, is import the data into 7 different pd.Series, perform your functions on each column, then concat (axis=1) to create a DataFrame.

Upvotes: 1

Related Questions