Reputation: 175
I want to leverage Spark(It is running on Databricks and I am using PySpark) in order to send parallel requests towards a REST API. Right now I might face two scenarios:
Any suggestions on how to distribute requests among nodes?
Thanks!
Upvotes: 7
Views: 8933
Reputation: 87069
Just create a dataframe with URLs (if you use different ones) and arguments for the API (if they aren't part of URL) - this could be done either via creating it explicitly from list, etc. or by reading data from external data source, like JSON files or something like (the spark.read
function).
And then you define the user defined function that will perform requests to the REST API and return data as a column. Something like this (not tested):
import urllib
df = spark.createDataFrame(
[("url1", "params1"), ("url2", "params2")],
("url", "params"))
@udf("body string, status int")
def do_request(url: str, params: str):
with urllib.request.urlopen(url) as f:
status = f.status
body = f.read().decode("utf-8")
return {'status': status, 'body': body}
res = df.withColumn("result", do_requests(col("url"), col("params")))
This will return dataframe with new column called result
that will have two fields - status
and body
(JSON answer). You need to add error handling, etc.
Upvotes: 11