Reputation: 261
I am building the OneHotEncoder using the full file.
def buildOneHotEncoder(training_file_name, categoricalCols):
one_hot_encoder = OneHotEncoder(sparse=False)
df = pd.read_csv(training_file_name, skiprows=0, header=0)
df = df[categoricalCols]
df = removeNaN(df, categoricalCols)
logging.info(str(df.columns))
one_hot_encoder.fit(df)
return one_hot_encoder
def removeNaN(df, categoricalCols):
# Replace any NaN values
for col in categoricalCols:
df[[col]] = df[[col]].fillna(value=CONSTANT_FILLER)
return df
Now i am using this same encoder when i processing the same file in chunks
for chunk in pd.read_csv(training_file_name, chunksize=CHUNKSIZE):
....
INPUT = chunk[categoricalCols]
INPUT = removeNaN(INPUT, categoricalCols)
one_hot_encoded = one_hot_encoder.transform(INPUT)
....
It's giving me error 'ValueError: Found unknown categories ['missing'] in column 2 during transform'
I can't process the full file at once as during training iterations memory is required to use all cores.
Upvotes: 2
Views: 9138
Reputation: 7049
One workaround is to initialize OneHotEncoder
with handle_unknown=
parameter:
one_hot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')
Upvotes: 2
Reputation: 149
Clean data from any nan data. The following code show you the count of nan data for each column
total_missing_data = data.isnull().sum().sort_values(ascending=False)
percent_of_missing_data = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat(
[
total_missing_data,
percent_of_missing_data
],
axis=1,
keys=['Total', 'Percent']
)
print(missing_data.head(10))
output such as:
Total Percent
age 2 0.284091
To get it is location:
df.loc[(data['age'].isnull())]
Then fill nan column using mean or meadiean:
df.age[62]=data.age.median()
or drop all nan rows:
df.dropna(inplace=True)
Upvotes: -1
Reputation: 261
The issue was with applying
df_merged_set_test = chunk.where(chunk['weblab']=="missing")
I was filtering the dataset based on the a field so for the all rows which are it fill NaN. I was later on replacing them with a missing flag.
The correct way
.where(chunk['weblab']=="missing").dropna()
Upvotes: 1