Getting data from API faster (multiple calls)

Question

I'm trying to gather sensor reads from a SCADA. I can access to this data using an API but it is limited and it only allows getting data of one sensor in each call, so if i have thousands of sensors and each call takes 3-5 seconds it takes too much time to collect all I need.

For this purpose i've created a python script that has a loop and it downloads the data for each sensor and concatenate each sensor data to a dataframe but as i said i takes too long. It looks like this:

for ninv in sensor_array:       
        base_url = 'https://XXXXXXXXXXXXXXXXX'
        params = {
            'deviceid': ninv,
            'valueids': inv_data,
            'from': date_from,
            'to': date_to,
            'resolution': '15min',
    
        headers = {'Authorization':'Bearer '+token,}
        response = session.get(base_url, headers=headers, params=params)
        df_aux = pd.DataFrame(response.json())
        df_datasensors = pd.concat([df_datasensors, df_aux], axis=0)

Is there any solution to improve this and get everything in less time? I've heard about multiprocessing but don't know how to implement it here.

Thanks!

Andrej Kesely · Accepted Answer

This example will give you the idea how to use multiprocessing.Pool with requests.get (the basic idea is to use Pool to get the Json data and in the main thread construct dataframe from this result, append it to the list and as a final step, concat all dataframes together):

import requests
import pandas as pd
from multiprocessing import Pool


def get_data(params):
    ninv, inv_data, date_from, date_to, token = params

    base_url = "https://XXXXXXXXXXXXXXXXX"
    params = {
        "deviceid": ninv,
        "valueids": inv_data,
        "from": date_from,
        "to": date_to,
        "resolution": "15min",
    }

    headers = {
        "Authorization": "Bearer " + token,
    }

    response = session.get(base_url, headers=headers, params=params)
    return response.json()


if __name__ == "__main__":

    # construct this data programatically (from the sensor_array?)
    data = [
        ("ninv1", "inv_data1", "date_from1", "date_to1", "tkn"),
        ("ninv2", "inv_data2", "date_from2", "date_to2", "tkn"),
        ("ninv3", "inv_data3", "date_from3", "date_to3", "tkn"),
        ("ninv4", "inv_data4", "date_from4", "date_to4", "tkn"),
        ("ninv5", "inv_data5", "date_from5", "date_to5", "tkn"),
    ]

    all_dfs = []
    with Pool(4) as pool:  # <-- 4 is the number of processes to use
        for result in pool.imap_unordered(get_data, data):
            all_dfs.append(pd.DataFrame(result))

    df_final = pd.concat(all_dfs)
    print(df_final)

Getting data from API faster (multiple calls)

Answers (1)

Related Questions