Reputation: 3770
I'm working with pyspark in a python3 enviromet. I have a dataframe and I'm trying to split a column of dense vectos to multiple columns values. My df is this:
df_vector = kmeansModel_2.transform(finalData).select(['scalaredFeatures',
'prediction'])
df_vector.show()
+--------------------+----------+
| scalaredFeatures|prediction|
+--------------------+----------+
|[0.56785108466505...| 0|
|[1.41962771166263...| 0|
|[2.20042295307707...| 0|
|[0.14196277116626...| 0|
|[1.41962771166263...| 0|
+-------------------------------+
Well, in order to do my task I'm using the following code:
def extract(row):
return (row.prediction, ) + tuple(row.scalaredFeatures.toArray().tolist())
df = df_vector.rdd.map(extract)toDF(["prediction"])
Unfortunately I get an error:
Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 52.0 failed 1 times, most recent failure: Lost task
0.0 in stage 52.0 (TID 434, localhost, executor driver):
org.apache.spark.api.python.PythonException: Traceback (most recent
call last):
File "pyspark/worker.py", line 123, in main
("%d.%d" % sys.version_info[:2], version))
Exception: Python in worker has different version 2.7 than that in
driver 3.6, PySpark cannot run with different minor versions.Please
check environment variables PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON
are correctly set.
Is there anybody whom can help me on this task? Thank!
Upvotes: 3
Views: 6821
Reputation: 528
Upvotes: 0
Reputation: 1929
If you use PyCharm, you could add PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON to run/debug configurations.
Upvotes: 5