FelipePerezR
FelipePerezR

Reputation: 175

Parallel REST API request using Spark(Databricks)

I want to leverage Spark(It is running on Databricks and I am using PySpark) in order to send parallel requests towards a REST API. Right now I might face two scenarios:

Any suggestions on how to distribute requests among nodes?

Thanks!

Upvotes: 7

Views: 8933

Answers (1)

Alex Ott
Alex Ott

Reputation: 87069

Just create a dataframe with URLs (if you use different ones) and arguments for the API (if they aren't part of URL) - this could be done either via creating it explicitly from list, etc. or by reading data from external data source, like JSON files or something like (the spark.read function).

And then you define the user defined function that will perform requests to the REST API and return data as a column. Something like this (not tested):

import urllib
df = spark.createDataFrame(
  [("url1", "params1"), ("url2", "params2")],
  ("url", "params"))

@udf("body string, status int")
def do_request(url: str, params: str):
  with urllib.request.urlopen(url) as f:
    status = f.status
    body = f.read().decode("utf-8")
  
  return {'status': status, 'body': body}
  

res = df.withColumn("result", do_requests(col("url"), col("params")))

This will return dataframe with new column called result that will have two fields - status and body (JSON answer). You need to add error handling, etc.

Upvotes: 11

Related Questions