pyspark.sql to JSON

Question

What's the best way to get a valid json document from pyspark.sql query. For example, if I run:

spark.sql("show tables")

I see that it returns a dataframe and I can call 'toJSON' on that to get an RDD. But I'm not able to find a good way to convert that into a json document. I called 'collect()' on it, but that doesn't return a valid json and does not seem to be the most efficient way to convert dataframe to json.

codebee · Accepted Answer

I got the json response using the method below:

def exec_spark_sql_query(query):
    json_rdd = spark.sql(query).toJSON()
    rdd_list = json_rdd.collect()
    output_json = ''
    for i, entry in enumerate(rdd_list):
        if (i == 0):
            output_json = output_json + '['
        if (i == len(rdd_list) - 1):
            return (output_json + entry + ']')
        output_json = output_json + entry + ','
    return output_json

I know this is definitely not the best way to get json response for a Python API call that executes an SQL query using SparkSQL, but this gets the job done.

pyspark.sql to JSON

Answers (2)

Related Questions