Pyspark Clarification required on Label Indexing in LogisticRegression

I am using pyspark 2.4.5.

I have spark dataframe with x number of features . I have my target label with 3 classes "High","Medium","Low".

I am doing label indexing before building logistic regression model .

So far good .

What is the problem I have ? Whenever I want to do model building I am doing label indexing . Every time I do label indexing pyspark shuffles the indexes . For the first run when target label value for High if it is given 1 next time it assigns 0 .

What help I need ? I need a solution so that always my target label values has to be assigned as for High:2. For Medium:1 and for Low:2

Solution I thought of Without using label indexing can I create a new column and map the target values as per my need .Can we do it like this ?.When I predict scores can I consider the same mapping as I have done during training

If label indexing is the only way then how any reference links will be helpful . I always want to map the label indexing as High:2. For Medium:1 and for Low:2

Any solutions or reference links will be very very helpful

Upvotes: 1

Answers (2)

Raghu

Reputation: 1712

The string indexer assigns value based on the frequency. May be in each run during the random split, your samples are differing in target labels. The best way is to use IndexToString() during predicting.

So, you save your string indexer model at the training and use it during preidction. Irrespective of the assigned integer, you will get back your high,low,medium as the prediction.

ind_str = IndexToString(inputCol='prediction',outputCol='pred_label',labels=pipeline_label.stages[0].labels)

In the above cases I had stored my pipeline during the training. So during predicting I load the pipeline back and use the stage 0 of the pipeline which was StringIndexer.

The same can be done also without the pipeline with just the model

Upvotes: 1

Som

Reputation: 6323

You can create column which is kind of rule based -

df.withColumn("label_index", expr("case when(label='High') then 2 when(label='Medium') then 1 else 0 end"))

Upvotes: 0

Pyspark Clarification required on Label Indexing in LogisticRegression

Answers (2)

Related Questions