Reputation: 844
What's the best way to get a valid json document from pyspark.sql query. For example, if I run:
spark.sql("show tables")
I see that it returns a dataframe and I can call 'toJSON' on that to get an RDD. But I'm not able to find a good way to convert that into a json document. I called 'collect()' on it, but that doesn't return a valid json and does not seem to be the most efficient way to convert dataframe to json.
Upvotes: 0
Views: 605
Reputation: 844
I got the json response using the method below:
def exec_spark_sql_query(query):
json_rdd = spark.sql(query).toJSON()
rdd_list = json_rdd.collect()
output_json = ''
for i, entry in enumerate(rdd_list):
if (i == 0):
output_json = output_json + '['
if (i == len(rdd_list) - 1):
return (output_json + entry + ']')
output_json = output_json + entry + ','
return output_json
I know this is definitely not the best way to get json response for a Python API call that executes an SQL query using SparkSQL, but this gets the job done.
Upvotes: 0
Reputation: 542
One way to do it is to write the JSON RDD as text files. The JSON will be correctly formatted.
df.toJSON().saveAsTextFile("/tmp/jsonRecords")
Note that this will write one file per partition. So there will be a need to concatenate them manually.
The approach is adapted from the answer here, which uses Scala.
Upvotes: 1