Reputation: 51
I'm using Spark in Java to process XML files. The package spark-xml package from databricks is used to read the xml files into dataframe.
The example xml files are:
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
<travel>
<details>
<date>20191203</date>
<amount>400</amount>
</details>
</travel>
</expenses>
</RowTag>
<RowTag>
<id>2</id>
<name>joe</name>
<expenses>
<food>
<details>
<date>20191204</date>
<amount>500</amount>
</details>
</food>
</expenses>
</RowTag>
The result spark Dataset<Row> df
is shown below, each row represents one xml file.
+--+------+----------------+
|id| name |expenses |
+---------+----------------+
|1 | john |[[20191203,400]]|
|2 | joe |[[20191204,500]]|
+--+------+----------------+
df.printSchema();
shows below:
root
|-- id: int(nullable = true)
|-- name: string(nullable = true)
|-- expenses: struct (nullable = true)
| |-- travel: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
| |-- food: struct (nullable = true)
| | |-- details: struct (nullable = true)
| | | |-- date: string (nullable = true)
| | | |-- amount: int (nullable = true)
The desired output dataframe is like:
+--+------+-------------+
|id| name |expenses_date|
+---------+-------------+
|1 | john |20191203 |
|2 | joe |20191204 |
+--+------+-------------+
And basically I want a generic solution to get the date from the xml with the following structure, in which only the tag <X>
will differ.
<RowTag>
<id>1</id>
<name>john</name>
<expenses>
**<X>**
<details>
<date>20191203</date>
<amount>400</amount>
</details>
**</X>**
</expenses>
</RowTag>
What I have tried:
spark.udf().register("getDate",(UDF1 <Row, String>) (Row row) -> {
return row.getStruct(0).getStruct(0).getAs("date").toString();
}, DataTypes.StringType);
df.select(callUDF("getDate",df.col("expenses")).as("expenses_date")).show();
But it didn't work, because row.getStruct(0) routes to <travel>
, but for row joe, there's no <travel>
tag under <expenses>
, so it returned a java.lang.NullPointerException
. What I want is a generic solution that for each row, it can auto-get the next tag name, e.g. row.getStruct(0)
routes to <travel>
for row john and to <food>
for row joe.
So my question is: how should I reformulate my UDF to achieve this?
Thanks in advance!! :)
Upvotes: 0
Views: 644
Reputation: 32680
The spark-xml package allows you to access nested fields directly in the select expression. Why are you looking for UDF?
df.selectExpr("id", "name", "COALESCE(`expenses`.`food`.`details`.`date`, `expenses`.`travel`.`details`.`date`) AS expenses_date" ).show()
Output:
+---+----+-------------+
| id|name|expenses_date|
+---+----+-------------+
| 1|john| 20191203|
| 2| joe| 20191204|
+---+----+-------------+
EDIT
If the only tag which is changing is the one after expenses
struct then you can search for all the fields under expenses
and then coalesce
the columns : expenses.X.details.date
. Something like this in Spark :
val expenses_fields = df.select(col("expenses.*")).columns
val date_cols = expenses_fields.map(f => col(s"`expenses`.`$f`.`details`.`date`"))
df.select(col("id"), col("name"), coalesce(date_cols: _*).alias("expenses_date")).show()
Still, you don't need using UDF!
Upvotes: 1