spark-xml on jupyter notebook

Question

I am trying to run spark-xml on my jupyter notebook in order to read xml files using spark.

from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'

I found out that this is way to use it. But when I try to import com.databricks.spark.xml._, I get an error saying

no module named "com"

Nassereddine BELGHITH · Accepted Answer

As I see you are not able to load xml file as it is , using pyspark and databricks lib, this problem happens offen, well try to run this command from your teminal or from your notebook as a shell command :

 pyspark --packages com.databricks:spark-xml_2.11:0.4.1

if it does not work you can try this work around, as you can read your file as a text then parse it.

#define your parser function: input is rdd: 
def parse_xml(rdd):
    """
    Read the xml string from rdd, parse and extract the elements,
    then return a list of list.
    """


    return results

#read the file as text at a RDD level
file_rdd = spark.read.text("/path/to/data/*.xml", wholetext=True).rdd
# parse xml tree, extract the records and transform to new RDD
records_rdd = file_rdd.flatMap(parse_xml)
# convert RDDs to DataFrame with the pre-defined schema
output_df = records_rdd.toDF(my_schema)

If the .toDf will not work import spark.implicit.

spark-xml on jupyter notebook

Answers (1)

Related Questions