Reputation: 31
I try to understand when to use Struct and when to use Map, because both can be used as nested documents like
{field1 :value1 , field2 : value2}
I think that struct has a schema, and map doesn't have one. Also i think that Map can have as key value all the types except null, when struct can have as key value only string? Also what about perfomance?
But then i saw that i can add fields on a struct using the
withField(String fieldName, Column col)
method of Column
So i tried to write a
public static Column aggregate(Column expr,
Column initialValue,
scala.Function2<Column,Column,Column> merge)
and in each step to add a field to a struct using the withField
And i got this type mismatch, its like saying that struct should have only the field "a"
, but i tried to add the field "ab"
violating the schema
due to data type mismatch: argument 3 requires struct<a:string>, but got struct<a:string,ab:string>
I started with {"a" : "b"}
and i tried to add the the pair "ab" "ewr"
inside the aggregate function.
*i tried the same with Map and it worked fine, the field was added.
Can i fix this, and add the "ab" field to the struct? If not why we have this withField
if we can't really add fields on Structs? Or only Map can do it?
I prefer to use structs from Map but i am not sure when to use each one.
Upvotes: 3
Views: 7909
Reputation: 5110
The difference between Struct and Map types is that in a Struct we define all possible keys in the schema and each value can have a different type (the key is the column name which is string). But for Map, we define the type for the key and the value, then we can add any (key, value) which respect the provided types.
When you use Map, you will see that your data file is much bigger than the same schema but with Struct (parquet for ex), and for the performance, it depends on data format and how the format store the data on the disk, but almost for all the formats, processing Map type is slower than Struct.
To add a field to the nested column, you need to replace the old column by a new one contains the new field:
df.withColumn("<col>", df.<col>.withField("<new field>", <col (lit or func)>))
Upvotes: 16