123
123

Reputation: 8951

Spark: Ambiguous reference to fields

I'm getting the following error attempting to flatten a highly nested structure:

org.apache.spark.sql.AnalysisException: Ambiguous reference to fields StructField(error,StructType(StructField(array,ArrayType(StructType(StructField(double,DoubleType,true), StructField(int,IntegerType,true), StructField(string,StringType,true)),true),true), StructField(double,DoubleType,true), StructField(int,IntegerType,true), StructField(string,StringType,true), StructField(struct,StructType(StructField(message,StringType,true), StructField(kind,StringType,true), StructField(stack,StringType,true)),true)),true), StructField(Error,StructType(StructField(array,ArrayType(StringType,true),true), StructField(string,StringType,true)),true)

I can't seem to figure out what in particular is causing this. What is the ambiguity, other than a deeply nested Struct?

Upvotes: 0

Views: 16937

Answers (1)

evinhas
evinhas

Reputation: 199

This happens when you are doing a join between 2 dataframes, and both dataframes have a field with same name. When you call for the duplicated field, Spark doesn't know which column are you requesting. Solution: rename the field in one the sides of the join, and it is done. Example

  • dfA is a dataframe with 2 columns => (id,name)
  • dfB is a dataframe with 3 columns => (id,name,description)

You are joining both dataframes by column "id" and you want to select the "name" column in the second one:

val dfJoined = dfA.join(dfB,Seq("id"),"inner").select("name")

As column "name" is existing in both dataframes, Spark cannot identify which "name" are you asking for.

Solution:

val dfRenamedB = dfB.withColumnRenamed("name","b_name")

Now, when you are joining both dataframes, you would get columns "name" and "b_name", and you could identify which one is the selected one.

Upvotes: 2

Related Questions