James Wine
James Wine

Reputation: 75

PySpark Conversion to Array Types

I'm currently dealing with the following error while trying to run pyspark.sql.functions.explode on a array column in a DataFrame in PySpark. I've tried creating a UDF to convert the column to a python list if it is not a list instance. However, this still throws the same error. In Pandas I'd typically pull out the row and determine what to do from there. I'm not sure how I can access this row to look at the data to understand what conditions I need to account for.

I'm more looking for debugging advice in general, but if you know the answer that's great too!

Py4JJavaError: An error occurred while calling o2850.withColumn. : org.apache.spark.sql.AnalysisException: cannot resolve 'explode(lot)' due to data type mismatch: input to function explode should be array or map type, not LongType;;

df.schema()

root
 |-- lists: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- data: string (nullable=true)

df = df.withColumn("list",df.lists)
df = df.withColumn('list',sf.explode(df.list))

Original Code

from pyspark.sql import functions as sf 

# create duplicate column to use with explode 
# explode the array datetype into multiple rows per element 

df = spark.read("s3a://path/parquet/*")     
df = df.withColumn("list",df.lists)
df = df.withColumn('list',sf.explode(df.list)) 

Upvotes: 1

Views: 7160

Answers (1)

Sahil Desai
Sahil Desai

Reputation: 3696

There is no need to used withcolumn you can directly explode the array.

df = spark.read("s3a://path/parquet/*")     
df.select(df['data'],explode(df['lists']).alias('list'))

Upvotes: 1

Related Questions