Denys
Denys

Reputation: 4557

Selecting columns not present in the dataframe

So, I am creating a dataframe from an XML file. It has some information on a dealer, and then a dealer has multiple cars - each car is an sub-element of the cars element and is represented by a value element - each cars.value element has various car attributes. So I use an explode function to create one row for each car for a dealer like follows:

exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))

And now I want to get various attributes of cars.value

I do it like this:

car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')

And that works fine. But sometimes the cars.value elements doesn't have all the attributes I specify in my query. So for example some cars.value elements might have only attribute1 - and then I will get a following error when running the above code:

pyspark.sql.utils.AnalysisException: u"cannot resolve 'attribute2' given input columns: [dealer_id,attribute1];"

How do I ask Spark to execute the same query anyway. but just return None for the attribute2 if it is not present?

UPDATE I read my data as follows:

initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')

exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))

Upvotes: 3

Views: 2226

Answers (1)

zero323
zero323

Reputation: 330083

Since you already make specific assumptions about the schema the best thing you can do is to define it explicitly with nullable optional fields and use it when importing data.

Let's say you expect documents similar to:

<rows>
    <row>
        <id>1</id>
        <objects>
            <object>
                <attribute1>...</attribute1>
                 ...
                <attributebN>...</attributeN>
            </object>
        </objects>
    </row>
</rows>

where attribute1, attribute2, ..., attributebN may not be present in a given batch but you can define a finite set of choices and corresponding types. For simplicity let's say there are only two options:

{("attribute1", StringType), ("attribute2", LongType)}

You can define schema as:

schema = StructType([
  StructField("objects", StructType([
    StructField("object", StructType([
      StructField("attribute1", StringType(), True),
      StructField("attribute2", LongType(), True)
    ]), True)
  ]), True),
  StructField("id", LongType(), True)
])

and use it with reader:

spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)

It will be valid for any subset of attributes ({∅, {attribute1}, {attribute2}, {attribute1, attribute2}}). At the same time is more efficient than depending on the schema inference.

Upvotes: 1

Related Questions