john doe
john doe

Reputation: 2253

Problems while applying an API call to a large dataframe?

With requests I am calling an API as follows:

def foo(input):

    payload = {'key': '', 'in': input ,'fj': 'm'}

    r = requests.get('https://api.example.com/api', params=payload)
    res = json.loads(r.input)
    return res

I also have a large pandas dataframe like this:

    ColA
0   The quick  fox jumps over the lazy 
1   The quick  fox  over the lazy dog
2   The quick brown fox jumps over the lazy dog
....

n   The  brown fox jumps over the  dog

Then I would like to apply it to a large pandas dataframe, then I tried to:

df['result'] = df[['ColA']].apply(foo, axis=1)

With the above approach it never finish. Thus, I tried this:

df['result'] = df['ColA'].apply(foo)

The problem is that the API is not receiving anything, furthermore, I got the following exception:

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Alternatively I tried to :

docs = df['ColA'].values.tolist()
list(map(foo, docs))

I still having the same issue. Any idea of how to pass a pandas column to the api efficiently?.

UPDATE

After trying to use multiprocessing, I noted that I have a JSONDecodeError: Expecting value: line 1 column 1 (cchar 0) error. Therefore, I guess this situation is correlated with a Caching issue, so my question is, if this is related with caching how can I solve this problem?.

UPDATE 2

---------------------------------------------------------------------------
RemoteTraceback                           Traceback (most recent call last)
RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File "<ipython-input-3-7d058c7b9ac1>", line 9, in get_data
    data = json.loads(r.text)
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/__init__.py", line 319, in loads
    return _default_decoder.decode(s)
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 339, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/json/decoder.py", line 357, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)
"""

The above exception was the direct cause of the following exception:

JSONDecodeError                           Traceback (most recent call last)
<ipython-input-11-6bb417b3ed92> in <module>()
      3 p = Pool(5)
      4 # get data/response only for _unique_ strings (parameters)
----> 5 rslt = pd.Series(p.map(get_data, df2['sents'].unique().tolist()),index=df['sents'].unique())
      6 # map responses back to DF (it'll take care of duplicates)
      7 df['new'] = df2['ColA'].map(rslt)

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/local/Cellar/python3/3.5.2_2/Frameworks/Python.framework/Versions/3.5/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Upvotes: 2

Views: 814

Answers (2)

MaxU - stand with Ukraine
MaxU - stand with Ukraine

Reputation: 210982

Inspired by the @GauthierFeuillen answer i wanted to adapt it to be more Pandas friendly:

import pandas as pd
from multiprocessing import Pool
import requests

url='https://api.example.com/api'

df = pd.read_csv("data.csv")

def get_data(text, url=url):
    r = requests.get(url,
                     params={'key': '<YOUR KEY>',
                             'in': text
                             'fj': 'm'})
    if r.status_code != requests.codes.ok:
        return np.nan
    return r.text

if __name__ == '__main__':
    p = Pool(5)
    # get data/response only for _unique_ strings (parameters)
    rslt = pd.Series(p.map(get_data, df['ColA'].unique().tolist()),
                     index=df['ColA'].unique())
    # map responses back to DF (it'll take care of duplicates)
    df['new'] = df['ColA'].map(rslt)

Upvotes: 2

Gauthier Feuillen
Gauthier Feuillen

Reputation: 184

This should fit your needs :

import pandas as pd
from multiprocessing import Pool
import requests

df = pd.read_csv("data.csv")

def getLink(link):
    return requests.get(link).text

if __name__ == '__main__':
    p = Pool(5)
    print (p.map(getLink, df["link"]))

Just change as you need (here I only took the text from the url). But really the idea is to use the multiprocessing package to parallelize the work :)

Upvotes: 2

Related Questions