Spark UDF: How to write a UDF on each row to extract a specific value in a nested struct?

Question

I'm using Spark in Java to process XML files. The package spark-xml package from databricks is used to read the xml files into dataframe.

The example xml files are:

The result spark Dataset df is shown below, each row represents one xml file.

+--+------+----------------+
|id| name |expenses        |
+---------+----------------+
|1 | john |[[20191203,400]]|
|2 | joe  |[[20191204,500]]|
+--+------+----------------+

df.printSchema(); shows below:

root
|-- id: int(nullable = true)
|-- name: string(nullable = true)
|-- expenses: struct (nullable = true)
|    |-- travel: struct (nullable = true)
|    |    |-- details: struct (nullable = true)
|    |    |    |-- date: string (nullable = true)
|    |    |    |-- amount: int (nullable = true)
|    |-- food: struct (nullable = true)
|    |    |-- details: struct (nullable = true)
|    |    |    |-- date: string (nullable = true)
|    |    |    |-- amount: int (nullable = true)

The desired output dataframe is like:

+--+------+-------------+
|id| name |expenses_date|
+---------+-------------+
|1 | john |20191203     |
|2 | joe  |20191204     |
+--+------+-------------+

And basically I want a generic solution to get the date from the xml with the following structure, in which only the tag will differ.

What I have tried:

spark.udf().register("getDate",(UDF1 ) (Row row) -> {
            return row.getStruct(0).getStruct(0).getAs("date").toString();
        }, DataTypes.StringType);

df.select(callUDF("getDate",df.col("expenses")).as("expenses_date")).show();

But it didn't work, because row.getStruct(0) routes to , but for row joe, there's no tag under , so it returned a java.lang.NullPointerException. What I want is a generic solution that for each row, it can auto-get the next tag name, e.g. row.getStruct(0) routes to for row john and to for row joe.

So my question is: how should I reformulate my UDF to achieve this?

Thanks in advance!! :)

Spark UDF: How to write a UDF on each row to extract a specific value in a nested struct?

Answers (1)

Related Questions