Pandas: Filling nan poor performance - avoid iterating over rows?

Question

I have a performance problem with filling missing values in my dataset. This concerns a 500mb / 5.000.0000 row dataset (Kaggle: Expedia 2013).

It would be easiest to use df.fillna(), but it seems I cannot use this to fill every NaN with a different value.

I created a lookup table:

srch_destination_id | Value
    2        0.0110
    3        0.0000
    5        0.0207
    7           NaN
    8           NaN
    9           NaN
    10       0.1500
    12       0.0114

This table contains per srch_destination_id the corresponding value to replace NaN with in dataset.

# Iterate over dataset row per row. If missing value (NaN), fill in the min. val
# found in lookuptable.
for row in range(len(dataset)):
    if pd.isnull(dataset.iloc[row]['prop_location_score2']):
        cell = dataset.iloc[row]['srch_destination_id']
        df.set_value(row, 'prop_location_score2', lookuptable.loc[cell])

This code works when iterating over 1000 rows, but when iterating over all 5 million rows, my computer never finishes (I waited hours).

Is there a better way to do what I'm doing? Did I make a mistake somewhere?

jpp · Accepted Answer

pd.Series.fillna does accept a series or a dictionary, as well as scalar replacement values.

Therefore, you can create a series mapping from lookup:

s = lookup.set_index('srch_destination')['Value']

Then use this to fill in NaN values in dataset:

dataset['prop_loc'] = dataset['prop_loc'].fillna(dataset['srch_destination'].map(s.get))

Notice that in the fillna input we are mapping an identifier from dataset. In addition, we use pd.Series.map to perform the necessary mapping.

Pandas: Filling nan poor performance - avoid iterating over rows?

Answers (1)

Related Questions