OdiumPura
OdiumPura

Reputation: 631

How to handle multiple http requests on Apache Airflow

I have a DB that has about 90 000 products and I need to update this DB with the response from an API to each product. The problem is, this API is single call (I need to send one product SKU [request] at time), so basically, I need to call this API 90000 times.

I know that airflow has HTTP packages:

from airflow.providers.http.operators.http import SimpleHttpOperator
from airflow.providers.http.sensors.http import HttpSensor

And I can crate a task using them like:

task_get_op = SimpleHttpOperator(
    task_id='get_op',
    method='GET',
    endpoint='get',
    data={"param1": "value1", "param2": "value2"},
    headers={},
    dag=dag,
)

Upvotes: 1

Views: 2270

Answers (1)

Thom Bedford
Thom Bedford

Reputation: 387

For 90,000 calls, a set of Airflow tasks isn't the right tool for the job, it'll be quite slow and cumbersome. Airflow also does a fair amount of logging for each HttpOperator call, so it'll be flooding logs etc.

If you really want to do this in Airflow multi-threaded, the best option I can suggest is splitting your 90k into maybe 9 sets of 10k calls and having a DAG with 9 tasks, each of which uses a PythonOperator to loop 10k times. These 9 tasks can then run in parallel which should speed it up somewhat, but it still won't be great for performance.

I'd also suggest instead of running an update for each row of data you have to see if there's a bulk API you can call for all products instead, potentially importing every update for every product into another table, then writing a query that does a single update in the DB.

Upvotes: 3

Related Questions