Reputation: 23
I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).
It would be easiest to use df.fillna()
, but it seems I cannot use this to fill every NaN with a different value.
I created a lookup
table:
srch_destination_id | Value
2 0.0110
3 0.0000
5 0.0207
7 NaN
8 NaN
9 NaN
10 0.1500
12 0.0114
This table contains per srch_destination_id
the corresponding value to replace NaN
with in dataset
.
# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
if pd.isnull(dataset.iloc[row]['prop_location_score2']):
cell = dataset.iloc[row]['srch_destination_id']
df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])
This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).
Is there a better way to do what I'm doing? Did I make a mistake somewhere?
Upvotes: 2
Views: 178
Reputation: 164753
pd.Series.fillna
does accept a series or a dictionary, as well as scalar replacement values.
Therefore, you can create a series mapping from lookup
:
s = lookup.set_index('srch_destination')['Value']
Then use this to fill in NaN
values in dataset
:
dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))
Notice that in the fillna
input we are mapping an identifier from dataset
. In addition, we use pd.Series.map
to perform the necessary mapping.
Upvotes: 2