Reputation: 39
I'm trying to gather sensor reads from a SCADA. I can access to this data using an API but it is limited and it only allows getting data of one sensor in each call, so if i have thousands of sensors and each call takes 3-5 seconds it takes too much time to collect all I need.
For this purpose i've created a python script that has a loop and it downloads the data for each sensor and concatenate each sensor data to a dataframe but as i said i takes too long. It looks like this:
for ninv in sensor_array:
base_url = 'https://XXXXXXXXXXXXXXXXX'
params = {
'deviceid': ninv,
'valueids': inv_data,
'from': date_from,
'to': date_to,
'resolution': '15min',
headers = {'Authorization':'Bearer '+token,}
response = session.get(base_url, headers=headers, params=params)
df_aux = pd.DataFrame(response.json())
df_datasensors = pd.concat([df_datasensors, df_aux], axis=0)
Is there any solution to improve this and get everything in less time? I've heard about multiprocessing but don't know how to implement it here.
Thanks!
Upvotes: 1
Views: 760
Reputation: 195633
This example will give you the idea how to use multiprocessing.Pool
with requests.get
(the basic idea is to use Pool
to get the Json data and in the main thread construct dataframe from this result, append it to the list and as a final step, concat all dataframes together):
import requests
import pandas as pd
from multiprocessing import Pool
def get_data(params):
ninv, inv_data, date_from, date_to, token = params
base_url = "https://XXXXXXXXXXXXXXXXX"
params = {
"deviceid": ninv,
"valueids": inv_data,
"from": date_from,
"to": date_to,
"resolution": "15min",
}
headers = {
"Authorization": "Bearer " + token,
}
response = session.get(base_url, headers=headers, params=params)
return response.json()
if __name__ == "__main__":
# construct this data programatically (from the sensor_array?)
data = [
("ninv1", "inv_data1", "date_from1", "date_to1", "tkn"),
("ninv2", "inv_data2", "date_from2", "date_to2", "tkn"),
("ninv3", "inv_data3", "date_from3", "date_to3", "tkn"),
("ninv4", "inv_data4", "date_from4", "date_to4", "tkn"),
("ninv5", "inv_data5", "date_from5", "date_to5", "tkn"),
]
all_dfs = []
with Pool(4) as pool: # <-- 4 is the number of processes to use
for result in pool.imap_unordered(get_data, data):
all_dfs.append(pd.DataFrame(result))
df_final = pd.concat(all_dfs)
print(df_final)
Upvotes: 1