Creating a Pyspark data frame with variable schema

Question

I want to create a pyspark dataframe in which there is a column with variable schema. So my data frame can look something like this:

| Id | Variable_Column                  |
|----|----------------------------------|
| 1  | [{"col1":"val1"}]                |
| 2  | [{"col1":"val2", "col2":"val3"}] |

So to achieve this. I started out like this:

schema = StructType([StructField("Id", IntegerType(), True),\
                      StructField("Variable_Column", ArrayType(StructType()), True)\
                               ])
valdict = dict()
valdict["col1"] = "val1"
values = [(1, [valdict])]
df = spark.createDataFrame(values, schema)
display(df)

| Id | Variable_Column |
|----|-----------------|
| 1  | [{}]            |

Doing it this way I'm creating a empty array. Also this does not feel right, I want the type of internal columns also to be preserved. Please suggest what is the right way to insert data. For my variable column, I'm using "ArrayType(StructType())", is that the right column type to use?

Oli · Accepted Answer

SOLUTION 1

If you simply want to create a column with a variable number of values, you can use ArrayType of StructType. In your case, you defined an empty StructType, hence the result you get.

You can define a dataframe like this:

df1 = spark.createDataFrame([ (1, [('name1', 'val1'), ('name2', 'val2')]),
                              (2, [('name3', 'val3')])],
           ['Id', 'Variable_Column'])
df1.show(truncate=False)

which corresponds to the example you provide:

+---+----------------------------+
|Id |Variable_Column             |
+---+----------------------------+
|1  |[[name1,val1], [name2,val2]]|
|2  |[[name3,val3]]              |
+---+----------------------------+

Note that you don't need to explicitly define the schema in that case but if you want to, it would look like this (you can call df1.schema to print it by the way):

schema = StructType([
             StructField('Id',LongType()),
             StructField('Variable_Column',ArrayType(StructType([
                   StructField('name',StringType()),
                   StructField('value',StringType())
             ])))
         ])

SOLUTION 2

Very similarly, you could use the MapType type like this:

df2 = spark.createDataFrame([ (1, dict([('name1', 'val1'), ('name2', 'val2')])), 
                              (2, dict([('name3', 'val3')]) )
              ], ['Id', 'Variable_Column'])
df2.show(truncate=False)

+---+---------------------------------+
|Id |Variable_Column                  |
+---+---------------------------------+
|1  |Map(name2 -> val2, name1 -> val1)|
|2  |Map(name3 -> val3)               |
+---+---------------------------------+

SOLUTION 3

In a comment, you say that you would also want variable types. That's not possible with dataframes. If that's truly what you want, you may not be using the right tool. But if it is just a corner case, you could keep record of the type of the data in a string like this:

df3 = spark.createDataFrame([ (1, [('name1', 'val1', 'string'),
                                   ('name2', '0.6', 'double')]),
                              (2, [('name3', '3', 'integer')])],
           ['Id', 'Variable_Column'])
df3.show(truncate=False)

+---+-----------------------------------------+
|Id |Variable_Column                          |
+---+-----------------------------------------+
|1  |[[name1,val1,string], [name2,0.6,double]]|
|2  |[[name3,3,integer]]                      |
+---+-----------------------------------------+

Creating a Pyspark data frame with variable schema

Answers (2)

Related Questions