codebee
codebee

Reputation: 844

pyspark.sql to JSON

What's the best way to get a valid json document from pyspark.sql query. For example, if I run:

spark.sql("show tables")

I see that it returns a dataframe and I can call 'toJSON' on that to get an RDD. But I'm not able to find a good way to convert that into a json document. I called 'collect()' on it, but that doesn't return a valid json and does not seem to be the most efficient way to convert dataframe to json.

Upvotes: 0

Views: 605

Answers (2)

codebee
codebee

Reputation: 844

I got the json response using the method below:

def exec_spark_sql_query(query):
    json_rdd = spark.sql(query).toJSON()
    rdd_list = json_rdd.collect()
    output_json = ''
    for i, entry in enumerate(rdd_list):
        if (i == 0):
            output_json = output_json + '['
        if (i == len(rdd_list) - 1):
            return (output_json + entry + ']')
        output_json = output_json + entry + ','
    return output_json

I know this is definitely not the best way to get json response for a Python API call that executes an SQL query using SparkSQL, but this gets the job done.

Upvotes: 0

cylim
cylim

Reputation: 542

One way to do it is to write the JSON RDD as text files. The JSON will be correctly formatted.

df.toJSON().saveAsTextFile("/tmp/jsonRecords")

Note that this will write one file per partition. So there will be a need to concatenate them manually.

The approach is adapted from the answer here, which uses Scala.

Upvotes: 1

Related Questions