Undetermined type error in Pyspark

Question

I have a requirement wherein i have to convert a dictionary into dataframe, below is the dictionary:

{'col1': None, 'product_volume_override': '1', 'col2': '70', 'col3': None},{'col1': None, 'col2': '1', 'col3': '70'}

Below is the spark code for same:

spark_df = sc.parallelize([{'col1': None, 'col2': '70', 'col3': None},{'col1': None, 'col2': '1', 'col3': '70'}]).toDF()

However it is throwing below error to me:

ValueError: Some of types cannot be determined by the first 100 rows, please try again with sampling

This error is coming when a particular column value is null for all records.

Can someone please help me with the pyspark implementation to handle this?

pauli · Accepted Answer

Define a schema for your dataframe and use nullable=True, for columns which you have null values.

y = StructType([StructField("col1",StringType(), nullable = True), 
                StructField("col2",StringType(), nullable = True),
                StructField("col3",StringType(), nullable = True),
               StructField("col4",StringType(), nullable = True)])

Now provide this schema to toDF() method

spark_df = sc.parallelize([{'col1': None, 'col2': '70', 'col3': None},{'col1': None, 'col2': '1', 'col3': '70'}]).toDF(schema = y)
spark_df.show()
+----+----+----+----+
|col1|col2|col3|col4|
+----+----+----+----+
|null|  70|null|null|
|null|   1|  70|null|
+----+----+----+----+

Undetermined type error in Pyspark

Answers (1)

Related Questions