PySpark schema not recognised

Question

i'm attempting to convert a csv file using this schema :

sch = StructType([
    StructField("id", StringType(), True),
    StructField("words", ArrayType((StringType())), True)
])

dataFile = 'mycsv.csv'

df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')

mycsv.csv contains :

id , words
a , test here

I expect df to contain [Row(id='a', words=['test' , 'here'])]

but instead its an empty array as df.collect() returns []

Is my schema defined correctly ?

Pushkr · Accepted Answer

Well, clearly your words column isnt of type Array its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -

sch = StructType([
    StructField("id", StringType(), True),
    StructField("words", StringType(), True)
])

edit : if you really really want 2nd column as Array/List of words , do this -

from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()

this outputs :

+---+--------------+
| id|         words|
+---+--------------+
| a |[, test, here]|
+---+--------------+

PySpark schema not recognised

Answers (1)

Related Questions