Béatrice M.
Béatrice M.

Reputation: 972

Pyspark - Create DataFrame from List of Lists with an array field

I want to load some sample data, and because it contains a field that is an array, I can't simply save it as CSV and load the CSV file.

from pyspark.sql.types import *

sample_data = [["prasad, jones",120,"M",[170,50],"brown","1999-10-15T19:50:23+00:00",34,0.1],
["maurice, khan",82,"M",[130,30],"blond","1988-02-01T19:50:23+00:00",67,0.32]]

customSchema = StructType([
  StructField("name", StringType(), True),
  StructField("income", IntegerType(), True),
  StructField("gender", StringType(), True),
  StructField("height_weight", ArrayType(
      StructType([
          StructField("height", IntegerType(), True),
          StructField("weight", IntegerType(), True)
      ]))),
  StructField("hair-color", StringType(), True),
  StructField("dateofbirth", TimestampType(), True),
  StructField("factorX", DoubleType(), True),
  StructField("factorY", DoubleType(), True)
])

# Try #1
df1 = spark.createDataFrame(sample_data,schema=customSchema)
# Try #2
df2 = spark.createDataFrame(spark.sparkContext.parallelize(sample_data),schema=customSchema) 

I tried to simply create a dataframe, or the parallelize it before loading it, as suggested by other similar questions/answers, but I keep getting the following error:

TypeError: element in array field height_weight: StructType can not accept object 130 in type <class 'int'>

What am I missing? Or, what would be a simpler way to load this data? I tried a tab separated text file, but spark.read.format('txt') did not work and I did not find any information about how to do it.

Upvotes: 0

Views: 2773

Answers (1)

B&#233;atrice M.
B&#233;atrice M.

Reputation: 972

It is because my ArrayType is misdefined. It is an array of integers [int,int], not an Array that has an Array of integers [[int],[int]].

    customSchema = StructType([
  StructField("name", StringType(), True),
  StructField("income", IntegerType(), True),
  StructField("gender", StringType(), True),
  StructField("height_weight", ArrayType(IntegerType()), True),
  StructField("hair-color", StringType(), True),
  StructField("dateofbirth", StringType(), True),
  StructField("factorX", IntegerType(), True),
  StructField("factorY", DoubleType(), True)
])

This is the correct schema.

Upvotes: 1

Related Questions