Gumpa
Gumpa

Reputation: 86

Dataframe loaded with different column type, and how to convert an ArrayType column to another

I have a dataframe with a column I created with collect_set. its type is:

t.StructField("list_of_stuff", t.ArrayType(t.StringType(), False), True)

I want to create a test which will validate the dataframe by comparing it to another one I'm loading from a json file while using the same schema. Although all the rows in the file contains valid array values in this field, the loaded data frame gets a schema with the below type (other columns are the same):

t.StructField("list_of_stuff", t.ArrayType(t.StringType(), True), True)

So, when I try to compare it by using assert_frame_equal, I get an error that the column is not the same.

So 2 questions here:

  1. Why does it load with t.ArrayType(t.StringType(), True) if I supplied a schema with t.ArrayType(t.StringType(), False)?
  2. How can I convert this column to t.ArrayType(t.StringType(), False)?

Upvotes: 2

Views: 45

Answers (1)

Gumpa
Gumpa

Reputation: 86

I managed to handle #2:

converter = udf(lambda x: x, t.ArrayType(t.StringType(), False))
df = df.withColumn("list_of_stuff", converter("list_of_stuff"))

Upvotes: 1

Related Questions