Reputation: 69
I am trying to run spark-xml on my jupyter notebook in order to read xml files using spark.
from os import environ
environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.databricks:spark-xml_2.10:0.4.1 pyspark-shell'
I found out that this is way to use it. But when I try to import com.databricks.spark.xml._
, I get an error saying
no module named "com"
Upvotes: 1
Views: 3104
Reputation: 640
As I see you are not able to load xml file as it is , using pyspark and databricks lib, this problem happens offen, well try to run this command from your teminal or from your notebook as a shell command :
pyspark --packages com.databricks:spark-xml_2.11:0.4.1
if it does not work you can try this work around, as you can read your file as a text then parse it.
#define your parser function: input is rdd:
def parse_xml(rdd):
"""
Read the xml string from rdd, parse and extract the elements,
then return a list of list.
"""
return results
#read the file as text at a RDD level
file_rdd = spark.read.text("/path/to/data/*.xml", wholetext=True).rdd
# parse xml tree, extract the records and transform to new RDD
records_rdd = file_rdd.flatMap(parse_xml)
# convert RDDs to DataFrame with the pre-defined schema
output_df = records_rdd.toDF(my_schema)
If the .toDf will not work import spark.implicit.
Upvotes: 1