jeevs
jeevs

Reputation: 261

OneHotEncoder ValueError: Found unknown categories

I am building the OneHotEncoder using the full file.

def buildOneHotEncoder(training_file_name, categoricalCols):
    one_hot_encoder = OneHotEncoder(sparse=False)

    df = pd.read_csv(training_file_name, skiprows=0, header=0)
    df = df[categoricalCols]
    df = removeNaN(df, categoricalCols)
    logging.info(str(df.columns))
    one_hot_encoder.fit(df)
    return one_hot_encoder
def removeNaN(df, categoricalCols):
    # Replace any NaN values
    for col in categoricalCols:
        df[[col]] = df[[col]].fillna(value=CONSTANT_FILLER)
    return df

Now i am using this same encoder when i processing the same file in chunks

for chunk in pd.read_csv(training_file_name, chunksize=CHUNKSIZE):
....
  INPUT = chunk[categoricalCols]
  INPUT = removeNaN(INPUT, categoricalCols)
  one_hot_encoded = one_hot_encoder.transform(INPUT)
....

It's giving me error 'ValueError: Found unknown categories ['missing'] in column 2 during transform'

I can't process the full file at once as during training iterations memory is required to use all cores.

Upvotes: 2

Views: 9138

Answers (3)

K3---rnc
K3---rnc

Reputation: 7049

One workaround is to initialize OneHotEncoder with handle_unknown= parameter:

one_hot_encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

Upvotes: 2

Basma Elshoky
Basma Elshoky

Reputation: 149

Clean data from any nan data. The following code show you the count of nan data for each column

total_missing_data = data.isnull().sum().sort_values(ascending=False)
percent_of_missing_data = (data.isnull().sum()/data.isnull().count()*100).sort_values(ascending=False)
missing_data = pd.concat(
    [
        total_missing_data, 
        percent_of_missing_data
    ], 
    axis=1, 
    keys=['Total', 'Percent']
)
print(missing_data.head(10))

output such as:

Total   Percent

age 2 0.284091

To get it is location: df.loc[(data['age'].isnull())]

Then fill nan column using mean or meadiean:

df.age[62]=data.age.median()

or drop all nan rows:

df.dropna(inplace=True)

Upvotes: -1

jeevs
jeevs

Reputation: 261

The issue was with applying

df_merged_set_test = chunk.where(chunk['weblab']=="missing")

I was filtering the dataset based on the a field so for the all rows which are it fill NaN. I was later on replacing them with a missing flag.

The correct way

  1. Clean the dataset i.e fill all na values for all columns
  2. Then filter and drop NaN rows i.e all values NaN rows .where(chunk['weblab']=="missing").dropna()

Upvotes: 1

Related Questions