Reputation: 75
I'm currently dealing with the following error while trying to run pyspark.sql.functions.explode
on a array column in a DataFrame in PySpark. I've tried creating a UDF to convert the column to a python list if it is not a list instance. However, this still throws the same error. In Pandas I'd typically pull out the row and determine what to do from there. I'm not sure how I can access this row to look at the data to understand what conditions I need to account for.
I'm more looking for debugging advice in general, but if you know the answer that's great too!
Py4JJavaError: An error occurred while calling o2850.withColumn.
: org.apache.spark.sql.AnalysisException: cannot resolve 'explode(
lot)' due to data type mismatch: input to function explode should be array or map type, not LongType;;
df.schema()
root
|-- lists: array (nullable = true)
| |-- element: long (containsNull = true)
|-- data: string (nullable=true)
df = df.withColumn("list",df.lists)
df = df.withColumn('list',sf.explode(df.list))
from pyspark.sql import functions as sf
# create duplicate column to use with explode
# explode the array datetype into multiple rows per element
df = spark.read("s3a://path/parquet/*")
df = df.withColumn("list",df.lists)
df = df.withColumn('list',sf.explode(df.list))
Upvotes: 1
Views: 7160
Reputation: 3696
There is no need to used withcolumn you can directly explode the array.
df = spark.read("s3a://path/parquet/*")
df.select(df['data'],explode(df['lists']).alias('list'))
Upvotes: 1