Reputation: 123
I am trying to read XML file:
df = spark.read.format('com.databricks.spark.xml').load('/path/to/my.xml')
and getting the following error:
java.lang.ClassNotFoundException: Failed to find data source: com.databricks.spark.xml
I've tried to:
install pyspark-xml with
$SPARK_HOME/bin/spark-shell --packages com.databricks:spark-xml_2.12:0.10.0
Run Spark with config: set jar_path = f'{SPARK_HOME}/jars/spark-xml_2.12-0.10.0.jar' spark = SparkSession.builder.config(conf=conf).config("spark.jars", jar_path).config("spark.executor.extraClassPath", jar_path).config("spark.executor.extraLibrary", jar_path).config("spark.driver.extraClassPath", jar_path).appName('my_app').getOrCreate()
Set evn variables: os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.10.0 pyspark'
Download jar file and put in into SPARK_HOME/jars
Here: https://github.com/databricks/spark-xml there is alternative solution for PySpark in paragraph "Pyspark notes", but I can't figure out how to read dataframe in order to pass it into function ext_schema_of_xml_df.
So, what else should I do to read XML with PySpark in JupyterLab?
Upvotes: 2
Views: 4570
Reputation: 1
Just change the jar_path
variable to below:
jar_path = "file:///D:f{SPARK_HOME}//jars//spark-xml_2.12-0.10.0.jar
Upvotes: 0
Reputation: 1387
As you've surmised, the thing is to get the package loaded such that PySpark will use it in your context in Jupyter.
Start your notebook with your regular imports:
import pandas as pd
from pyspark.sql import SparkSession
import os
Before you instantiate your session, do:
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0 pyspark-shell'
Notes:
$ spark-submit --version
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 3.0.2
/_/
Using Scala version 2.12.10, OpenJDK 64-Bit Server VM, 1.8.0_292
Branch HEAD
Compiled by user centos on 2021-02-16T06:09:22Z
Revision 648457905c4ea7d00e3d88048c63f360045f0714
Url https://gitbox.apache.org/repos/asf/spark.git
Type --help for more information.
The second part of the package version just has to be what has been provided for the given version of Scala - you can find that here: https://github.com/databricks/spark-xml - so in my case, since I had Spark built with Scala 2.12, the package I needed was com.databricks:spark-xml_2.12:0.12.0
Now instantiate your session:
# Creates a session on a local master
sparkSesh = SparkSession.builder.appName("XML_Import") \
.master("local[*]").getOrCreate()
Find a simple .xml file whose structure you know - in my case I used the XML version of nmap output
thisXML = "simple.xml"
The reason for that is so that you can provide appropriate values for 'rootTag' and 'rowTag' below:
someXSDF = sparkSesh.read.format('xml') \
.option('rootTag', 'nmaprun') \
.option('rowTag', 'host') \
.load(thisXML)
If the file is small enough, you can just do a .toPandas() to review it:
someXSDF.toPandas()[["address", "ports"]][:5]
Then close the session.
sparkSesh.stop()
Closing Notes:
pyspark --packages com.databricks:spark-xml_2.12:0.12.0
you should see it load up properly in the PySpark shell
"Exception: Java gateway process exited before sending its port number"
which is a pretty funny way to explain that a package version number is wrongpy4j.protocol.Py4JJavaError: An error occurred while calling o43.load. : java.lang.NoClassDefFoundError: scala/Product$class
os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.12:0.12.0,org.apache.spark:spark-avro_2.12:3.1.2 pyspark-shell'
Upvotes: 4