Reputation: 1013
I am learning how to handle missing values in a dataset. I have a table with ~1million entries. I'm trying to deal with a small number of missing values.
My data concerns a bicycle-share system and my missing values are start & end locations.
Data: missing starting stations, only 7 values
Data: missing ending station, 24 values altogether
I want to fill the NaN
in both cases with the mode of the "opposite" station. Example, for start_station==21
, I want to see what is the most common end_station
, and use that to fill in my missing value.
E.g. df.loc[df['start_station'] == 21].end_station.mode()
I tried to achieve this with a function:
def inpute_end_station(df):
for index, row in df.iterrows():
if pd.isnull(df.loc[index, 'end_station']):
start_st = df.loc[index, 'start_station']
mode = df.loc[df['start_station'] == start_st].end_station.mode()
df.loc[index, 'end_station'].fillna(mode, inplace=True)
The last line throws a AttributeError: 'numpy.float64' object has no attribute 'fillna'
. If instead I just use df.loc[index, 'end_station'] = mode
I get ValueError: Incompatible indexer with Series
.
Am I approaching this properly? I understand it's bad practice to modify something you're iterating over in pandas so what's the correct way of changing start_station
and end_station
columns and replacing the NaN
s with the corresponding mode of the complimentary station?
Upvotes: 1
Views: 1330
Reputation: 7625
In my opinion, when you want to iterate over a column in pandas like this, the best practice is using apply()
function.
For this particular case, I would suggest the following approach, that is shown below on my sample data. I do not have much experience using mode()
method, so I used value_counts()
method in combination with first_valid_index()
method in order to determine mode value.
# import pandas
import pandas as pd
# make a sample data
list_of_rows = [
{'start_station': 1, 'end_station': 1},
{'start_station': None, 'end_station': 1},
{'start_station': 1, 'end_station': 2},
{'start_station': 1, 'end_station': 3},
{'start_station': 2, 'end_station': None},
{'start_station': 2, 'end_station': 3},
{'start_station': 2, 'end_station': 3},
]
# make a pandas data frame
df = pd.DataFrame(list_of_rows)
# define a function
def fill_NaNs_in_end_station(row):
if pd.isnull(row['end_station']):
start_station = row['start_station']
return df[df['start_station']==start_station].end_station.value_counts().first_valid_index()
return row['end_station']
# apply function to dataframe
df['end_station'] = df.apply(lambda row: fill_NaNs_in_end_station(row), axis=1)
Upvotes: 1