Reputation: 11
We are working to migrate to data bricks runtime 10.4 LTS from 9.1 LTS but we're running into weird behavioral issues. Our existing code works up until runtime 10.3 and in 10.4 it stopped working.
Problem: We have a nested Json file that we are flattening into a spark data frame using the code below:
adaccountsdf = df.withColumn('Exp_Organizations',
F.explode(F.col('organizations.organization')))\
.withColumn('Exp_AdAccounts',
F.explode(F.col('Exp_Organizations.ad_accounts')))\
.select(F.col('Exp_Organizations.id').alias('organizationId'),
F.col('Exp_Organizations.name').alias('organizationName'),
F.col('Exp_AdAccounts.id').alias('adAccountId'),
F.col('Exp_AdAccounts.name').alias('adAccountName'),
F.col('Exp_AdAccounts.timezone').alias('timezone'))
Now when we query the dataframe it works when we do the following selects (hid results due to confidentiality):
display(adaccountsdf.select("*"))
Result of above statement here:
When I display the schema of the dataframe we get the following:
root
|-- organizationId: string (nullable = true)
|-- organizationName: string (nullable = true)
|-- adAccountId: string (nullable = true)
|-- adAccountName: string (nullable = true)
|-- timezone: string (nullable = true)
so everything looks like it should. The moment we start selecting the last 3 fields(adAccountId, adAccountName and timezone):
display(adaccountsdf.select("adAccountId","adAccountName"))
We get the error AnalysisException: No such struct field id in 0, 1.
Image of the result of above statement:
However when I run the statement display(adaccountsdf.select("adAccountId"))
it works just fine.
Does anyone know why this is happening? It's a very strange error that only shows up in databricks runtime 10.4. All previous runtimes incl 10.3, 10.2,10.1 and 9.1 LTS work fine. The issue seems to be caused by using the explode function on an already exploded column in the data frame.
UPDATE:
For some reason when I run adaccountsdf.cache()
before I run my select statements the issue disappears. I would still like to know what's causing this issue in runtime 10.4 but not the other ones.
Upvotes: 1
Views: 802
Reputation: 140
In my case, I had a similar issue with the Databricks 10.4 LTS. I needed to add several caches but to break the execution plan. After opening a support ticket to Microsoft a bugfix was uploaded to our image and the problem was resolved (the catalyst optimizer was infinitely optimizing the execution plan for complex types or with operations in the same column)
Upvotes: 1