John Brown
John Brown

Reputation: 123

Read XML using PySpark in Jupyter notebook

I am trying to read XML file: df = spark.read.format('com.databricks.spark.xml').load('/path/to/my.xml') and getting the following error:

java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml

I've tried to:

Here: https://github.com/databricks/spark-xml there is alternative solution for PySpark in paragraph "Pyspark notes", but I can't figure out how to read dataframe in order to pass it into function ext_schema_of_xml_df.

So, what else should I do to read XML with PySpark in JupyterLab?

Upvotes: 2

Views: 4570

Answers (2)

Amith jb
Amith jb

Reputation: 1

Just change the jar_path variable to below:

jar_path = "file:///D:f{SPARK_HOME}//jars//spark-xml_2.12-0.10.0.jar

Upvotes: 0

James_SO
James_SO

Reputation: 1387

As you've surmised, the thing is to get the package loaded such that PySpark will use it in your context in Jupyter.

Start your notebook with your regular imports:

import pandas as pd
from pyspark.sql import SparkSession
import os

Before you instantiate your session, do:

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0 pyspark-shell'

Notes:

  • the first part of the package version has to match the version of Scala that your spark was built with - you can find this out by doing spark-submit --version from the command line. e.g.
$ spark-submit --version
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 3.0.2
      /_/
                        
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user centos on 2021-02-16T06:09:22Z
Revision 648457905c4ea7d00e3d88048c63f360045f0714
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.

The second part of the package version just has to be what has been provided for the given version of Scala - you can find that here: https://github.com/databricks/spark-xml - so in my case, since I had Spark built with Scala 2.12, the package I needed was com.databricks:spark-xml_2.12:0.12.0

Now instantiate your session:

# Creates a session on a local master
sparkSesh = SparkSession.builder.appName("XML_Import") \
    .master("local[*]").getOrCreate()

Find a simple .xml file whose structure you know - in my case I used the XML version of nmap output

thisXML = "simple.xml"

The reason for that is so that you can provide appropriate values for 'rootTag' and 'rowTag' below:

someXSDF = sparkSesh.read.format('xml') \
        .option('rootTag', 'nmaprun') \
        .option('rowTag', 'host') \
        .load(thisXML)

If the file is small enough, you can just do a .toPandas() to review it:

someXSDF.toPandas()[["address", "ports"]][:5]

enter image description here

Then close the session.

sparkSesh.stop()

Closing Notes:

  • if you want to test this outside of Jupyter, just go the command line and do
pyspark --packages com.databricks:spark-xml_2.12:0.12.0

you should see it load up properly in the PySpark shell

  • if the package version doesn't match up with the scala version, you might get this error: "Exception: Java gateway process exited before sending its port number" which is a pretty funny way to explain that a package version number is wrong
  • if you've loaded the wrong package for the version of Scala that was used to build your Spark, you'll likely get this error when you try to read the XML: py4j.protocol.Py4JJavaError: An error occurred while calling o43.load. : java.lang.NoClassDefFoundError: scala/Product$class
  • if the read seems work but you get an empty dataframe, you probably specified the wrong root tag and/or row tag
  • if you need to support multiple read types (let's say you also needed to be able to read Avro files in the same notebook), you would list multiple packages with commas (no spaces) separating them, like so:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.1.2 pyspark-shell'
  • My version info: Python 3.6.9, Spark 3.0.2

Upvotes: 4

Related Questions