Reputation: 7567
I have a Pandas dataframe and I want to call an API and pass some parameters from that dataframe. Then I get the results from the API and create a new column from that. This is my working code:
import http.client, urllib.request, urllib.parse, urllib.error, base64
import pandas as pd
import json
headers = {
# Request headers
'Content-Type': 'application/json',
'Ocp-Apim-Subscription-Key': 'my-api-key-goes-here',
}
params = urllib.parse.urlencode({
})
df = pd.read_csv('mydata.csv',names=['id','text'])
def call_api(row):
try:
body = {
"documents": [
{
"language": "en",
"id": row['id'],
"text": row['text']
}
]
}
conn = http.client.HTTPSConnection('api-url')
conn.request("POST", "api-endpoint" % params, str(body), headers)
response = conn.getresponse()
data = response.read()
data = json.loads(data)
return data['documents'][0]['score']
conn.close()
except Exception as e:
print("[Errno {0}] {1}".format(e.errno, e.strerror))
df['score'] = df.apply(call_api,axis=1)
The above works quite well. However, I have a limit on the number of api requests I can do and the API let me send up to 100 documents in the same request, by adding more on the body['documents']
list.
The returned data follow this schema:
{
"documents": [
{
"score": 0.92,
"id": "1"
},
{
"score": 0.85,
"id": "2"
},
{
"score": 0.34,
"id": "3"
}
],
"errors": null
}
So, what I am looking for is to apply the same api call not row by row, but in batches of 100 rows each time. Is there any way to do this in Pandas or should I iterate on dataframe rows, create the batches myself and then iterate again to add the returned values on the new column?
Upvotes: 2
Views: 243
Reputation: 249103
DataFrame.apply()
is slow; we can do better. This will create the "documents" list-of-dicts in one go:
df.to_dict('records')
Then all you need to do is split it into chunks of 100:
start = 0
while start < len(df):
documents = df.iloc[start:start+100].to_dict('records')
call_api(documents)
start += 100
Finally, you could use a single HTTP session with the requests
library:
import requests
session = requests.Session()
call_api(session, documents)
Then inside call_api()
you do session.post(...)
. This is more efficient than setting up a new connection each time.
Upvotes: 2