blue-sky
blue-sky

Reputation: 53806

PySpark schema not recognised

i'm attempting to convert a csv file using this schema :

sch = StructType([
    StructField("id", StringType(), True),
    StructField("words", ArrayType((StringType())), True)
])

dataFile = 'mycsv.csv'

df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')

mycsv.csv contains :

id , words
a , test here

I expect df to contain [Row(id='a', words=['test' , 'here'])]

but instead its an empty array as df.collect() returns []

Is my schema defined correctly ?

Upvotes: 0

Views: 400

Answers (1)

Pushkr
Pushkr

Reputation: 3619

Well, clearly your words column isnt of type Array its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -

sch = StructType([
    StructField("id", StringType(), True),
    StructField("words", StringType(), True)
])

edit : if you really really want 2nd column as Array/List of words , do this -

from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()

this outputs :

+---+--------------+
| id|         words|
+---+--------------+
| a |[, test, here]|
+---+--------------+

Upvotes: 1

Related Questions