Reputation: 4557
So, I am creating a dataframe
from an XML file. It has some information on a dealer, and then a dealer has multiple cars - each car is an sub-element of the cars
element and is represented by a value
element - each cars.value
element has various car attributes. So I use an explode
function to create one row for each car for a dealer like follows:
exploded_dealer = df.select('dealer_id',explode('cars.value').alias('a_car'))
And now I want to get various attributes of cars.value
I do it like this:
car_details_df = exploded_dealer.select('dealer_id','a_car.attribute1','a_car.attribute2')
And that works fine. But sometimes the cars.value
elements doesn't have all the attributes I specify in my query. So for example some cars.value
elements might have only attribute1 - and then I will get a following error when running the above code:
pyspark.sql.utils.AnalysisException: u"cannot resolve 'attribute2' given input columns: [dealer_id,attribute1];"
How do I ask Spark to execute the same query anyway. but just return None
for the attribute2 if it is not present?
UPDATE I read my data as follows:
initial_file_df = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='dealer').load('<xml file location>')
exploded_dealer = df.select('financial_data',explode('cars.value').alias('a_car'))
Upvotes: 3
Views: 2226
Reputation: 330083
Since you already make specific assumptions about the schema the best thing you can do is to define it explicitly with nullable
optional fields and use it when importing data.
Let's say you expect documents similar to:
<rows>
<row>
<id>1</id>
<objects>
<object>
<attribute1>...</attribute1>
...
<attributebN>...</attributeN>
</object>
</objects>
</row>
</rows>
where attribute1
, attribute2
, ..., attributebN
may not be present in a given batch but you can define a finite set of choices and corresponding types. For simplicity let's say there are only two options:
{("attribute1", StringType), ("attribute2", LongType)}
You can define schema as:
schema = StructType([
StructField("objects", StructType([
StructField("object", StructType([
StructField("attribute1", StringType(), True),
StructField("attribute2", LongType(), True)
]), True)
]), True),
StructField("id", LongType(), True)
])
and use it with reader:
spark.read.schema(schema).option("rowTag", "row").format("xml").load(...)
It will be valid for any subset of attributes ({∅, {attribute1}, {attribute2}, {attribute1, attribute2}}). At the same time is more efficient than depending on the schema inference.
Upvotes: 1