Reputation: 972
I want to load some sample data, and because it contains a field that is an array, I can't simply save it as CSV and load the CSV file.
from pyspark.sql.types import *
sample_data = [["prasad, jones",120,"M",[170,50],"brown","1999-10-15T19:50:23+00:00",34,0.1],
["maurice, khan",82,"M",[130,30],"blond","1988-02-01T19:50:23+00:00",67,0.32]]
customSchema = StructType([
StructField("name", StringType(), True),
StructField("income", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("height_weight", ArrayType(
StructType([
StructField("height", IntegerType(), True),
StructField("weight", IntegerType(), True)
]))),
StructField("hair-color", StringType(), True),
StructField("dateofbirth", TimestampType(), True),
StructField("factorX", DoubleType(), True),
StructField("factorY", DoubleType(), True)
])
# Try #1
df1 = spark.createDataFrame(sample_data,schema=customSchema)
# Try #2
df2 = spark.createDataFrame(spark.sparkContext.parallelize(sample_data),schema=customSchema)
I tried to simply create a dataframe, or the parallelize it before loading it, as suggested by other similar questions/answers, but I keep getting the following error:
TypeError: element in array field height_weight: StructType can not accept object 130 in type <class 'int'>
What am I missing? Or, what would be a simpler way to load this data? I tried a tab separated text file, but spark.read.format('txt')
did not work and I did not find any information about how to do it.
Upvotes: 0
Views: 2773
Reputation: 972
It is because my ArrayType is misdefined. It is an array of integers [int,int], not an Array that has an Array of integers [[int],[int]].
customSchema = StructType([
StructField("name", StringType(), True),
StructField("income", IntegerType(), True),
StructField("gender", StringType(), True),
StructField("height_weight", ArrayType(IntegerType()), True),
StructField("hair-color", StringType(), True),
StructField("dateofbirth", StringType(), True),
StructField("factorX", IntegerType(), True),
StructField("factorY", DoubleType(), True)
])
This is the correct schema.
Upvotes: 1