Reputation: 53806
i'm attempting to convert a csv file using this schema :
sch = StructType([
StructField("id", StringType(), True),
StructField("words", ArrayType((StringType())), True)
])
dataFile = 'mycsv.csv'
df = sqlContext.read.option("mode", "DROPMALFORMED").schema(sch).option("delimiter", format(",")).option("charset", "UTF-8").load(dataFile, format='com.databricks.spark.csv', header='true', inferSchema='false')
mycsv.csv contains :
id , words
a , test here
I expect df to contain [Row(id='a', words=['test' , 'here'])]
but instead its an empty array as df.collect()
returns []
Is my schema defined correctly ?
Upvotes: 0
Views: 400
Reputation: 3619
Well, clearly your words
column isnt of type Array
its of type StringType() only. and since you have DROPMALFORMED enabled, its droping the records because its not matching Array schema. Try schema like below and it should work fine -
sch = StructType([
StructField("id", StringType(), True),
StructField("words", StringType(), True)
])
edit : if you really really want 2nd column as Array/List of words , do this -
from pyspark.sql.functions import split
df.select(df.id,split(df.words," ").alias('words')).show()
this outputs :
+---+--------------+
| id| words|
+---+--------------+
| a |[, test, here]|
+---+--------------+
Upvotes: 1