Reputation: 43
I am new in creating schema for xml. I used to work with xsd previously for parsing xml data.
I am trying to use the spark read format method. However I dont see the seller id in the schema. Is there a way I can get both seller_id and trade_id into my data.
df_trade_loan = spark.read.format("com.databricks.spark.xml").option("rowTag","trade").option("rootTag","loan").load("dbfs:/FileStore/shared_uploads/trades/*")
My xml file looks like as below.
<loan>
<seller>
<id>11</id>
</seller>
<trade id="67" type="Standard">
<advance>
<date>2011-03-09</date>
<amount>16466.76</amount>
<amount_gbp>16466.76</amount_gbp>
<percentage>90.0</percentage>
</advance>
<discount>
<percentage>1.0</percentage>
<on>Facevalue</on>
</discount>
<expected_payment_date>2011-03-18 00:00:00 +0000</expected_payment_date>
<settlement_date>2011-03-25</settlement_date>
<arrears>
<in_arrears>No</in_arrears>
<in_arrears_on_date>nan</in_arrears_on_date>
</arrears>
<payment>
<state>Paid</state>
</payment>
<price_grade>6</price_grade>
<currency>GBP</currency>
<face_value>
<amount>18296.4</amount>
<amount_gbp>18296.4</amount_gbp>
</face_value>
<outstanding_principal>
<amount>0.0</amount>
<amount_gbp>0.0</amount_gbp>
</outstanding_principal>
<crystalised_loss>
<amount>nan</amount>
<date>nan</date>
</crystalised_loss>
<gross_yield>
<annualised>14.164038846995776</annualised>
</gross_yield>
</trade>
</loan>
The current schema looks like as below
root
|-- _id: long (nullable = true)
|-- _type: string (nullable = true)
|-- advance: struct (nullable = true)
| |-- amount: double (nullable = true)
| |-- amount_gbp: double (nullable = true)
| |-- date: string (nullable = true)
| |-- percentage: double (nullable = true)
|-- arrears: struct (nullable = true)
| |-- in_arrears: string (nullable = true)
| |-- in_arrears_on_date: string (nullable = true)
|-- crystalised_loss: struct (nullable = true)
| |-- amount: string (nullable = true)
| |-- date: string (nullable = true)
|-- currency: string (nullable = true)
|-- discount: struct (nullable = true)
| |-- on: string (nullable = true)
| |-- percentage: double (nullable = true)
|-- expected_payment_date: string (nullable = true)
|-- face_value: struct (nullable = true)
| |-- amount: double (nullable = true)
| |-- amount_gbp: double (nullable = true)
|-- gross_yield: struct (nullable = true)
| |-- annualised: double (nullable = true)
|-- outstanding_principal: struct (nullable = true)
| |-- amount: double (nullable = true)
| |-- amount_gbp: double (nullable = true)
|-- payment: struct (nullable = true)
| |-- state: string (nullable = true)
|-- price_grade: long (nullable = true)
|-- settlement_date: string (nullable = true)
Upvotes: 0
Views: 365
Reputation: 43
df_tade_seller = spark.read.format("com.databricks.spark.xml").option("rowTag","loan").option("rootTag","seller").load("adl://haaldatalake.azuredatalakestore.net/use_cases/recommendation/tempsubas/tempsubas/trades/*")
This code worked.
Upvotes: 1