Reputation: 2253
With requests I am calling an API as follows:
def foo(input):
payload = {'key': '', 'in': input ,'fj': 'm'}
r = requests.get('https://api.example.com/api', params=payload)
res = json.loads(r.input)
return res
I also have a large pandas dataframe like this:
ColA
0 The quick fox jumps over the lazy
1 The quick fox over the lazy dog
2 The quick brown fox jumps over the lazy dog
....
n The brown fox jumps over the dog
Then I would like to apply it to a large pandas dataframe, then I tried to:
df['result'] = df[['ColA']].apply(foo, axis=1)
With the above approach it never finish. Thus, I tried this:
df['result'] = df['ColA'].apply(foo)
The problem is that the API is not receiving anything, furthermore, I got the following exception:
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Alternatively I tried to :
docs = df['ColA'].values.tolist()
list(map(foo, docs))
I still having the same issue. Any idea of how to pass a pandas column to the api efficiently?.
UPDATE
After trying to use multiprocessing, I noted that I have a JSONDecodeError: Expecting value: line 1 column 1 (cchar 0)
error. Therefore, I guess this situation is correlated with a Caching issue, so my question is, if this is related with caching how can I solve this problem?.
UPDATE 2
---------------------------------------------------------------------------
RemoteTraceback Traceback (most recent call last)
RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
File "<ipython-input-3-7d058c7b9ac1>", line 9, in get_data
data = json.loads(r.text)
File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 319, in loads
return _default_decoder.decode(s)
File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 339, in decode
obj, end = self.raw_decode(s, idx=_w(s, 0).end())
File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 357, in raw_decode
raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""
The above exception was the direct cause of the following exception:
JSONDecodeError Traceback (most recent call last)
<ipython-input-11-6bb417b3ed92> in <module>()
3 p = Pool(5)
4 # get data/response only for _unique_ strings (parameters)
----> 5 rslt = pd.Series(p.map(get_data, df2['sents'].unique().tolist()),index=df['sents'].unique())
6 # map responses back to DF (it'll take care of duplicates)
7 df['new'] = df2['ColA'].map(rslt)
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
258 in a list that is returned.
259 '''
--> 260 return self._map_async(func, iterable, mapstar, chunksize).get()
261
262 def starmap(self, func, iterable, chunksize=None):
/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
606 return self._value
607 else:
--> 608 raise self._value
609
610 def _set(self, i, obj):
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Upvotes: 2
Views: 814
Reputation: 210982
Inspired by the @GauthierFeuillen answer i wanted to adapt it to be more Pandas friendly:
import pandas as pd
from multiprocessing import Pool
import requests
url='https://api.example.com/api'
df = pd.read_csv("data.csv")
def get_data(text, url=url):
r = requests.get(url,
params={'key': '<YOUR KEY>',
'in': text
'fj': 'm'})
if r.status_code != requests.codes.ok:
return np.nan
return r.text
if __name__ == '__main__':
p = Pool(5)
# get data/response only for _unique_ strings (parameters)
rslt = pd.Series(p.map(get_data, df['ColA'].unique().tolist()),
index=df['ColA'].unique())
# map responses back to DF (it'll take care of duplicates)
df['new'] = df['ColA'].map(rslt)
Upvotes: 2
Reputation: 184
This should fit your needs :
import pandas as pd
from multiprocessing import Pool
import requests
df = pd.read_csv("data.csv")
def getLink(link):
return requests.get(link).text
if __name__ == '__main__':
p = Pool(5)
print (p.map(getLink, df["link"]))
Just change as you need (here I only took the text from the url). But really the idea is to use the multiprocessing package to parallelize the work :)
Upvotes: 2