Import python library on pyspark

Question

Pretty new to Python.

I would like to read in some XML files from S3 and query them. I am connected to AWS and have spun up some EC2 clusters but I am not sure how to import the libraries I need to get the data.

I think using the xmlutils library to convert from xml to json and then using the read.json in the sqlcontext library which i do have access to will work (see below)

 converter = xml2json("S3 logs", "output.sql", encoding="utf-8")
 converter.convert()

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

logs = sqlContext.read.json("output.sql")
logs.registerAsTable("logs")

query_results = sqlContext.sql("SELECT * from logs...")

EDIT

I am trying to use this block of code to get xmlutils installed in my virtual environment on Spark from the cloudera link. (already set SparkConf and SparkContext)

def import_my_special_package(x):
    import my.special.package
    return x

int_rdd = sc.parallelize([1, 2, 3, 4])
int_rdd.map(lambda x: import_my_special_package(x))
int_rdd.collect()

I tried passing both xmlutils and 'xmlutils' in the function argument as x but it didn't work. Am I doing something wrong? Thanks

Import python library on pyspark

Answers (1)

Related Questions