Reputation: 588
I have many csv files and I am trying to pass all the data that they contain into a database. For this reason, I found that I could use the glob library to iterate over all csv files in my folder. Following is the code I used:
import requests as req
import pandas as pd
import glob
import json
endpoint = "testEndpoint"
path = "test/*.csv"
for fname in glob.glob(path):
print(fname)
df = pd.read_csv(fname)
for index, row in df.iterrows():
#print(row['ID'], row['timestamp'], row['date'], row['time'],
# row['vltA'], row['curA'], row['pwrA'], row['rpwrA'], row['frq'])
print(row['timestamp'])
testjson = {"data":
{"installationid": row['ID'],
"active": row['pwrA'],
"reactive": row['rpwrA'],
"current": row['curA'],
"voltage": row['vltA'],
"frq": row['frq'],
}, "timestamp": row['timestamp']}
payload = {"payload": [testjson]}
json_data = json.dumps(payload)
response = req.post(
endpoint, data=json_data, headers=headers)
This code seems to work fine in the beginning. However, after some time it starts to become really slow (I noticed this because I print the timestamp as I upload the data) and eventually stops completely. What is the reason for this? Is something I am doing here really inefficient?
Upvotes: 0
Views: 156
Reputation: 149035
I can see 3 possible problems here:
read_csv
is fast, but it loads the content of a full file in memory. If the files are really large, you could exhaust the real memory and start using swap which has terrible performancesiterrows
: you seem to build a dataframe - meaning a data structure optimized for column wise access - to then access it by rows. This already is a bad idea and iterrows
is know to have terrible performances because it builds a Series per each rowWithout more info I can hardly say more, but IHMO the higher gain is to be found on database feeding so here in point 3. If nothing can be done on that point, of if further performance gain is required, I would try to replace pandas with the csv module which is row oriented and has a limited footprint because it only processes one line at a time whatever the file size.
Finally, and if it makes sense for your use case, I would try to use one thread for the reading of the csv file that would feed a queue and a pool of threads to send requests to the database. That should allow to gain the HTTP overhead. But beware, depending on the endpoint implementation it could not improve much if really the database access if the limiting factor.
Upvotes: 1