maks
maks

Reputation: 11

reading xml file with closed tag and many attributes in scala

I am reading xml file in scala

<tag1>
  <tag2 id="0" attr1="abc" ... />
  ..
</tag1>

This was already reported as issue and closed . https://github.com/databricks/spark-xml/pull/303

However i am not able to resolve this.

import org.apache.spark.sql.SparkSession
import com.databricks.spark.xml._
import org.apache.spark.sql.types.{StructType, StructField, DoubleType,StringType}
import org.apache.spark.sql.{Row, SaveMode}

object stack {
  def main(args: Array[String]) {
    val spark = SparkSession.builder.getOrCreate()

    val customSchema = StructType(Array(
      StructField("id", DoubleType, nullable = true),
      StructField("attr1", StringType, nullable = true),
      ...
      ...
    ))  
    val df = spark.read
        .option("rowTag", "tag2")
        .format("com.databricks.spark.xml")
        .schema(customSchema)
        .load("dummy.xml")

    import spark.sql
    import spark.implicits._

    df.createOrReplaceTempView("temp1")
    sql("SELECT * from temp1 limit 5").show()
  }
}

However df.show(5) displays no rows.

The resolution talks about using XmlInputFormat which i have not tried , if someone can guide then it will be helpful.

Similar type of solution works with nested xml file.

<books>
  <book> .. </book>
  <name> abc </name>
</books>

I want to see the dataframe with values to show. and later i want to read many xml files and join them in a sql query.

Upvotes: 1

Views: 1016

Answers (2)

maks
maks

Reputation: 11

Thanks Mikhail for providing guidance however issue was very small. Sorry for not providing the actual xml file record earlier as the issue was in the attributes.

<users>
    <row Id="-1" Reputation="1" ..... />
</users>

The attributes were starting with caps, when i made them in small then this my solution started working(ofcourse i printed the schema before using it as suggested by Mikhail)

Upvotes: 0

Mikhail Ionkin
Mikhail Ionkin

Reputation: 625

You need to add _ prefix for attributes.

Data (dummy.xml):

<tag1>
    <tag2 id="0" attr1="abc"/>
    <tag2 id="1" attr1="abd" />
    <tag2 id="2" attr1="abd" />
</tag1>

Solution:

package main

import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}

object Main extends App {
  val spark = SparkSession.builder.config("spark.master", "local").getOrCreate()

  val customSchema = StructType(Array(
    StructField("_id", DoubleType, nullable = true),
    StructField("_attr1", StringType, nullable = true)
  ))
  val df = spark.read
    .option("rowTag", "tag2")
    .format("com.databricks.spark.xml")
    .schema(customSchema)
    .load("dummy.xml")
  import spark.sql

  df.createOrReplaceTempView("temp1")
  sql("SELECT * from temp1 limit 5").show()
}

Result:

+---+------+
|_id|_attr1|
+---+------+
|0.0|   abc|
|1.0|   abd|
|2.0|   abd|
+---+------+

How I got it:

  1. Found that problem with scheme, because it work with children elements
  2. Remove (or comment) custom schema (// .schema(customSchema))
  3. Print schema that spark is resolve (df.printSchema())
  4. Find what's you need
  5. Create new schema

See also: Extracting tag attributes from xml using sparkxml

PS: Sorry for my English

Upvotes: 1

Related Questions