Jack Daniel
Jack Daniel

Reputation: 2611

Pyspark 2.0 - IndextoString Error

I am working on Spark 2.0 using Pyspark for a classification problem. I am trying to get the original names of the predictions of a Classification algorithm. But I am failing to do so.

Code:

predictions = dtModel.transform(self._pred)
converter = IndexToString(inputCol="prediction", outputCol="role")
converted = converter.transform(predictions)

Error:

  File "/hba03/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_2397452/container_1498495374459_2397452_01_000001/build.zip/src/com/ci/buyerroletagging/service/ModelBuilder.py", line 45, in decision_tree_classifier
    converted = converter.transform(predictions.select('prediction'))
  File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/ml/base.py", line 105, in transform
    return self._transform(dataset)
  File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 252, in _transform
    return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
  File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
    format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o545.transform.
: java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
    at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:313)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:280)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:214)
    at java.lang.Thread.run(Thread.java:745)    

Predictions:

+--------------------+--------------------+--------------------+--------------------+----------+
|           user_guid|            features|       rawPrediction|         probability|prediction|
+--------------------+--------------------+--------------------+--------------------+----------+
|9c0393cd-67e1-425...|(239,[1,89,125,21...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...|       1.0|
|fdbaccb8-5946-472...|(239,[0,78,124,18...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...|       0.0|
|fdbaccb8-5946-472...|(239,[0,78,130,18...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...|       0.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...|       1.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...|       2.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...|       2.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...|       1.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...|       1.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...|       2.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|58b6246a-7f2a-40b...|(239,[0,64,124,19...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|58b6246a-7f2a-40b...|(239,[0,64,124,19...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...|       0.0|
|d05b08ab-eef0-496...|(239,[10,80,124,1...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...|       0.0|
|d05b08ab-eef0-496...|(239,[10,80,124,1...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...|       3.0|
|b35a734a-98ba-4e3...|(239,[0,30,129,22...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...|       0.0|
+--------------------+--------------------+--------------------+--------------------+----------+
only showing top 20 rows

Am I missing anything here?

Upvotes: 3

Views: 2570

Answers (2)

waterg
waterg

Reputation: 53

Try getting metadata and labels from dataframe and apply the labels below

# Make predictions.
predictionsRaw = model.transform(testData)

# Convert predictions back to labels
meta = [
    f.metadata for f in predictionsRaw.schema.fields if f.name == "labelIndex"]
labels = meta[0]["ml_attr"]["vals"]

from pyspark.ml.feature import IndexToString
converter = IndexToString(inputCol="prediction", outputCol="predictedLabel",   labels=labels)
PredictedLabels = converter.transform(predictionsRaw)

# Select example rows to display.
PredictedLabels.select("label","labelIndex","prediction", "predictedLabel", "features").show(5)

Upvotes: 2

desertnaut
desertnaut

Reputation: 60369

Well, it is completely mysterious why, but you need to provide the labels argument (although the examples in the documentation seem to work without it). Here is a toy example with my own predictions and 2 classes:

predictions.show()
# +-----+-----------+-------------+-----------+----------+ 
# |label|   features|rawPrediction|probability|prediction|
# +-----+-----------+-------------+-----------+----------+ 
# |    0|[140.0,0.0]|    [2.0,0.0]|  [1.0,0.0]|       0.0|
# |    0|[150.0,0.0]|    [2.0,0.0]|  [1.0,0.0]|       0.0|
# |    1|[160.0,1.0]|    [0.0,2.0]|  [0.0,1.0]|       1.0|
# |    1|[170.0,1.0]|    [0.0,2.0]|  [0.0,1.0]|       1.0|
# +-----+-----------+-------------+-----------+----------+

converter = IndexToString(inputCol="prediction", outputCol="role", labels=['a', 'b'])
converted = converter.transform(predictions)
converted.show()
# +-----+-----------+-------------+-----------+----------+----+ 
# |label|   features|rawPrediction|probability|prediction|role|
# +-----+-----------+-------------+-----------+----------+----+ 
# |    0|[140.0,0.0]|    [2.0,0.0]|  [1.0,0.0]|       0.0|   a|
# |    0|[150.0,0.0]|    [2.0,0.0]|  [1.0,0.0]|       0.0|   a|
# |    1|[160.0,1.0]|    [0.0,2.0]|  [0.0,1.0]|       1.0|   b|
# |    1|[170.0,1.0]|    [0.0,2.0]|  [0.0,1.0]|       1.0|   b|
# +-----+-----------+-------------+-----------+----------+----+

If I omit the labels argument, I get the same error as you. So, if your own labels go from 0.0 to 3.0 as in your sample, you will need something like labels=['a', 'b', 'c', 'd'] - in general, labels must be a list of the same length as the number of your labels.

Upvotes: 4

Related Questions