Reputation: 2611
I am working on Spark 2.0 using Pyspark for a classification problem. I am trying to get the original names of the predictions of a Classification algorithm. But I am failing to do so.
Code:
predictions = dtModel.transform(self._pred)
converter = IndexToString(inputCol="prediction", outputCol="role")
converted = converter.transform(predictions)
Error:
File "/hba03/yarn/nm/usercache/sbeathanabhotla/appcache/application_1498495374459_2397452/container_1498495374459_2397452_01_000001/build.zip/src/com/ci/buyerroletagging/service/ModelBuilder.py", line 45, in decision_tree_classifier
converted = converter.transform(predictions.select('prediction'))
File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/ml/base.py", line 105, in transform
return self._transform(dataset)
File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 252, in _transform
return DataFrame(self._java_obj.transform(dataset._jdf), dataset.sql_ctx)
File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/vol1/cloudera/parcels/SPARK2-2.0.0.cloudera1-1.cdh5.7.0.p0.113931/lib/spark2/python/lib/py4j-0.10.3-src.zip/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling o545.transform.
: java.lang.ClassCastException: org.apache.spark.ml.attribute.UnresolvedAttribute$ cannot be cast to org.apache.spark.ml.attribute.NominalAttribute
at org.apache.spark.ml.feature.IndexToString.transform(StringIndexer.scala:313)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:237)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)
Predictions:
+--------------------+--------------------+--------------------+--------------------+----------+
| user_guid| features| rawPrediction| probability|prediction|
+--------------------+--------------------+--------------------+--------------------+----------+
|9c0393cd-67e1-425...|(239,[1,89,125,21...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...| 1.0|
|fdbaccb8-5946-472...|(239,[0,78,124,18...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...| 0.0|
|fdbaccb8-5946-472...|(239,[0,78,130,18...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...| 0.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...| 1.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...| 2.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|883bca4e-9a74-4dd...|(239,[1,28,123,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...| 2.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...| 1.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,44.0,0.0,0.0...|[0.0,1.0,0.0,0.0,...| 1.0|
|883bca4e-9a74-4dd...|(239,[1,28,124,13...|[0.0,0.0,42.0,0.0...|[0.0,0.0,1.0,0.0,...| 2.0|
|883bca4e-9a74-4dd...|(239,[1,28,128,13...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|58b6246a-7f2a-40b...|(239,[0,64,124,19...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|58b6246a-7f2a-40b...|(239,[0,64,124,19...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...| 0.0|
|d05b08ab-eef0-496...|(239,[10,80,124,1...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...| 0.0|
|d05b08ab-eef0-496...|(239,[10,80,124,1...|[0.0,0.0,0.0,21.0...|[0.0,0.0,0.0,0.28...| 3.0|
|b35a734a-98ba-4e3...|(239,[0,30,129,22...|[96.0,0.0,0.0,0.0...|[1.0,0.0,0.0,0.0,...| 0.0|
+--------------------+--------------------+--------------------+--------------------+----------+
only showing top 20 rows
Am I missing anything here?
Upvotes: 3
Views: 2570
Reputation: 53
Try getting metadata and labels from dataframe and apply the labels below
# Make predictions.
predictionsRaw = model.transform(testData)
# Convert predictions back to labels
meta = [
f.metadata for f in predictionsRaw.schema.fields if f.name == "labelIndex"]
labels = meta[0]["ml_attr"]["vals"]
from pyspark.ml.feature import IndexToString
converter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labels)
PredictedLabels = converter.transform(predictionsRaw)
# Select example rows to display.
PredictedLabels.select("label","labelIndex","prediction", "predictedLabel", "features").show(5)
Upvotes: 2
Reputation: 60369
Well, it is completely mysterious why, but you need to provide the labels
argument (although the examples in the documentation seem to work without it). Here is a toy example with my own predictions
and 2 classes:
predictions.show()
# +-----+-----------+-------------+-----------+----------+
# |label| features|rawPrediction|probability|prediction|
# +-----+-----------+-------------+-----------+----------+
# | 0|[140.0,0.0]| [2.0,0.0]| [1.0,0.0]| 0.0|
# | 0|[150.0,0.0]| [2.0,0.0]| [1.0,0.0]| 0.0|
# | 1|[160.0,1.0]| [0.0,2.0]| [0.0,1.0]| 1.0|
# | 1|[170.0,1.0]| [0.0,2.0]| [0.0,1.0]| 1.0|
# +-----+-----------+-------------+-----------+----------+
converter = IndexToString(inputCol="prediction", outputCol="role", labels=['a', 'b'])
converted = converter.transform(predictions)
converted.show()
# +-----+-----------+-------------+-----------+----------+----+
# |label| features|rawPrediction|probability|prediction|role|
# +-----+-----------+-------------+-----------+----------+----+
# | 0|[140.0,0.0]| [2.0,0.0]| [1.0,0.0]| 0.0| a|
# | 0|[150.0,0.0]| [2.0,0.0]| [1.0,0.0]| 0.0| a|
# | 1|[160.0,1.0]| [0.0,2.0]| [0.0,1.0]| 1.0| b|
# | 1|[170.0,1.0]| [0.0,2.0]| [0.0,1.0]| 1.0| b|
# +-----+-----------+-------------+-----------+----------+----+
If I omit the labels
argument, I get the same error as you. So, if your own labels go from 0.0 to 3.0 as in your sample, you will need something like labels=['a', 'b', 'c', 'd']
- in general, labels
must be a list of the same length as the number of your labels.
Upvotes: 4