Reputation: 11
I am reading xml file in scala
<tag1>
<tag2 id="0" attr1="abc" ... />
..
</tag1>
This was already reported as issue and closed . https://github.com/databricks/spark-xml/pull/303
However i am not able to resolve this.
import org.apache.spark.sql.SparkSession
import com.databricks.spark.xml._
import org.apache.spark.sql.types.{StructType, StructField, DoubleType,StringType}
import org.apache.spark.sql.{Row, SaveMode}
object stack {
def main(args: Array[String]) {
val spark = SparkSession.builder.getOrCreate()
val customSchema = StructType(Array(
StructField("id", DoubleType, nullable = true),
StructField("attr1", StringType, nullable = true),
...
...
))
val df = spark.read
.option("rowTag", "tag2")
.format("com.databricks.spark.xml")
.schema(customSchema)
.load("dummy.xml")
import spark.sql
import spark.implicits._
df.createOrReplaceTempView("temp1")
sql("SELECT * from temp1 limit 5").show()
}
}
However df.show(5)
displays no rows.
The resolution talks about using XmlInputFormat which i have not tried , if someone can guide then it will be helpful.
Similar type of solution works with nested xml file.
<books>
<book> .. </book>
<name> abc </name>
</books>
I want to see the dataframe with values to show. and later i want to read many xml files and join them in a sql query.
Upvotes: 1
Views: 1016
Reputation: 11
Thanks Mikhail for providing guidance however issue was very small. Sorry for not providing the actual xml file record earlier as the issue was in the attributes.
<users>
<row Id="-1" Reputation="1" ..... />
</users>
The attributes were starting with caps, when i made them in small then this my solution started working(ofcourse i printed the schema before using it as suggested by Mikhail)
Upvotes: 0
Reputation: 625
You need to add _
prefix for attributes.
Data (dummy.xml):
<tag1>
<tag2 id="0" attr1="abc"/>
<tag2 id="1" attr1="abd" />
<tag2 id="2" attr1="abd" />
</tag1>
Solution:
package main
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.types.{DoubleType, StringType, StructField, StructType}
object Main extends App {
val spark = SparkSession.builder.config("spark.master", "local").getOrCreate()
val customSchema = StructType(Array(
StructField("_id", DoubleType, nullable = true),
StructField("_attr1", StringType, nullable = true)
))
val df = spark.read
.option("rowTag", "tag2")
.format("com.databricks.spark.xml")
.schema(customSchema)
.load("dummy.xml")
import spark.sql
df.createOrReplaceTempView("temp1")
sql("SELECT * from temp1 limit 5").show()
}
Result:
+---+------+
|_id|_attr1|
+---+------+
|0.0| abc|
|1.0| abd|
|2.0| abd|
+---+------+
How I got it:
// .schema(customSchema)
)df.printSchema()
) See also: Extracting tag attributes from xml using sparkxml
PS: Sorry for my English
Upvotes: 1