Bn.F76
Bn.F76

Reputation: 1013

How do I correctly impute these NaN values with modes of another column?

I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.

My data concerns a bicycle-share system and my missing values are start & end locations.

Data: missing starting stations, only 7 values

enter image description here

Data: missing ending station, 24 values altogether

enter image description here

I want to fill the NaN in both cases with the mode of the "opposite" station. Example, for start_station==21, I want to see what is the most common end_station, and use that to fill in my missing value. E.g. df.loc[df['start_station'] == 21].end_station.mode()

I tried to achieve this with a function:

def inpute_end_station(df):
    for index, row in df.iterrows():    
        if pd.isnull(df.loc[index, 'end_station']):

            start_st = df.loc[index, 'start_station']
            mode = df.loc[df['start_station'] == start_st].end_station.mode()
            df.loc[index, 'end_station'].fillna(mode, inplace=True)

The last line throws a AttributeError: 'numpy.float64' object has no attribute 'fillna'. If instead I just use df.loc[index, 'end_station'] = mode I get ValueError: Incompatible indexer with Series.

Am I approaching this properly? I understand it's bad practice to modify something you're iterating over in pandas so what's the correct way of changing start_station and end_station columns and replacing the NaNs with the corresponding mode of the complimentary station?

Upvotes: 1

Views: 1330

Answers (1)

Jaroslav Bezděk
Jaroslav Bezděk

Reputation: 7625

In my opinion, when you want to iterate over a column in pandas like this, the best practice is using apply() function.

For this particular case, I would suggest the following approach, that is shown below on my sample data. I do not have much experience using mode() method, so I used value_counts() method in combination with first_valid_index() method in order to determine mode value.

# import pandas
import pandas as pd

# make a sample data
list_of_rows = [
  {'start_station': 1, 'end_station': 1},
  {'start_station': None, 'end_station': 1},
  {'start_station': 1, 'end_station': 2},
  {'start_station': 1, 'end_station': 3},
  {'start_station': 2, 'end_station': None},
  {'start_station': 2, 'end_station': 3},
  {'start_station': 2, 'end_station': 3},
]

# make a pandas data frame
df = pd.DataFrame(list_of_rows)

# define a function
def fill_NaNs_in_end_station(row):
    if pd.isnull(row['end_station']):
        start_station = row['start_station']
        return df[df['start_station']==start_station].end_station.value_counts().first_valid_index()
    return row['end_station']

# apply function to dataframe
df['end_station'] = df.apply(lambda row: fill_NaNs_in_end_station(row), axis=1)

Upvotes: 1

Related Questions