Reputation: 4888
I have a csv file, and I'm preparing it's data to be trained using different machine learning algorithms, so I replaced numeric missing data with the mean of that column, but how to deal with missing categorical data, should I replace them with the most frequent element? and what the easiest why to do it in python using pandas.
Code:
dataset = pd.read_csv('doc.csv')
X = dataset.iloc[:, [2, 4, 5, 6, 7, 9,10 ,11]].values
y = dataset.iloc[:, -1].values
Row number 2 contain the categorical data.
first row value :
[3, 'S', 22.0, 1, 0, 7.25, 107722, 2]
Upvotes: 0
Views: 1148
Reputation: 2240
Regarding the modelling part of your question, you're better off asking that at CrossValidated.
If there are too many records with missing data, you could just remove that column from consideration altogether. There are some other excellent suggestions on this StackOverflow post, including sci-kit learn's Imputer()
method, or just letting the model handle the missing data.
Regarding replacing a column look into the DataFrame.replace()
method
DataFrame.replace(
to_replace=None,
value=None,
inplace=False,
limit=None,
regex=False,
method='pad',
axis=None)
An example usage of this for your dataset, assuming that the missing column values are called 'N' and you are replacing them by some other category 'S' (which you found out using the DataFrame.mode()
method): dataset[1].replace('N', 'S')
.
Upvotes: 3