Reputation: 428
I am making some requests to an API which postag the following text as follows:
def pos(text):
payload = {'key': 'thekey', 'of': 'json', 'ilang': 'ES', \
'txt': text, \
'tt': 'a', \
'uw': 'y', 'lang': 'es'}
r = requests.get('http://api.meaningcloud.com/parser-2.0', params=payload, stream = True)
return r.json()
At the beginning, it gave me a ValueError
:
---------------------------------------------------------------------------
JSONDecodeError Traceback (most recent call last)
<ipython-input-19-ac09c6405340> in <module>()
1
----> 2 df['tags'] = df['tweets'].apply(transform)
3 df
/usr/local/lib/python3.5/site-packages/pandas/core/series.py in apply(self, func, convert_dtype, args, **kwds)
2292 else:
2293 values = self.asobject
-> 2294 mapped = lib.map_infer(values, f, convert=convert_dtype)
2295
2296 if len(mapped) and isinstance(mapped[0], Series):
pandas/src/inference.pyx in pandas.lib.map_infer (pandas/lib.c:66124)()
<ipython-input-18-707ac7b399b4> in transform(a_lis)
25
26 def transform(a_lis):
---> 27 analysis = pos(str(a_lis))
28 a_list = parse_tree(analysis['token_list'], [])
29 return a_list
<ipython-input-18-707ac7b399b4> in pos(text)
8
9 r = requests.get('http://api.meaningcloud.com/parser-2.0', params=payload, stream = True)
---> 10 return r.json()
11
12 def parse_tree(token, a_list):
/usr/local/lib/python3.5/site-packages/requests/models.py in json(self, **kwargs)
864 # used.
865 pass
--> 866 return complexjson.loads(self.text, **kwargs)
867
868 @property
/usr/local/lib/python3.5/site-packages/simplejson/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, use_decimal, **kw)
514 parse_constant is None and object_pairs_hook is None
515 and not use_decimal and not kw):
--> 516 return _default_decoder.decode(s)
517 if cls is None:
518 cls = JSONDecoder
/usr/local/lib/python3.5/site-packages/simplejson/decoder.py in decode(self, s, _w, _PY3)
368 if _PY3 and isinstance(s, binary_type):
369 s = s.decode(self.encoding)
--> 370 obj, end = self.raw_decode(s)
371 end = _w(s, end).end()
372 if end != len(s):
/usr/local/lib/python3.5/site-packages/simplejson/decoder.py in raw_decode(self, s, idx, _w, _PY3)
398 elif ord0 == 0xef and s[idx:idx + 3] == '\xef\xbb\xbf':
399 idx += 3
--> 400 return self.scan_once(s, idx=_w(s, idx).end())
JSONDecodeError: Expecting value: line 1 column 1 (char 0)
Therefore I handled the exception and applied it to a pandas dataframe column with:
df = pd.read_csv('../data.csv')
df['tagged_text'] = df['tweets'].apply(transform)
However, with some instances (columns) I got None
:
text tagged_text
Siento que estoy en un cuarto oscuro y hay sil... [(sentar, VI-S1PSABL-N4), (que, CSSN9), (estar...
Los mejores de @UEoficial Sebastián Jaime, Sey... None
#ColoColoJuegaEnEl13 la primera y adentro mier... None
Juguito heladoooo de melón: me siento se... None
@sxfiacrespo @lunasoledadhern Hola Luna... [(@sxfiacrespo @lunasoledadhern, NPUU-N-), (ho...
Thus, my question is why at some texts (columns) I am getting None
and how can I correctly tag those None
instances?. Note that I made some tests and there is no problem with the text, since for those None
a json with all the tagged content is returned. For example consider this function application.
Upvotes: 1
Views: 241
Reputation: 1121186
This does next to nothing:
except ValueError:
np.nan
That only references the np.nan
object. If you want to return it, you need to do so explicitly:
except ValueError:
return np.nan
otherwise the function just.. ends, which means None
is returned.
Other notes:
r = requests.get('http://api.meaningcloud.com/parser-2.0', data=payload, stream = True)
json_data = json.dumps(r.json())
data = yaml.load(json_data)
return data
is a really expensive way of spelling
r = requests.post('http://api.meaningcloud.com/parser-2.0', data=payload)
return r.json()
Loading JSON into Python, then producing JSON again, then using a YAML parser to turn the JSON back to Python is somewhat excessive. I also removed the stream=True
; that's only needed when you want to process the response data as a stream (which the response.json()
method doesn't do).
According to the API documentation, txt
is supposed to be a single string.I'd not use str(a_lis)
to produce that; if you have a list of strings, just join those into one long string with ' '.join(a_lis)
. However, I'm sure that pandas.Series.apply()
passes in individual values (e.g. strings) to your function, at which point there is no need to join anything at all (but your a_lis
variable name is very confusing in that case).
The API also specifies that it uses POST requests (I'm surprised they accept GET still anyway). Using a POST
request (requests.post()
) will allow you to send much larger pieces of text for analysis. Use the data
keyword. I've used the correct syntax in my last sample above.
That you used GET is also the reason you get a ValueError
:
>>> r = requests.get('http://api.meaningcloud.com/parser-2.0', params=payload)
>>> r.status_code
414
>>> r.reason
'Request-URI Too Long'
>>> r = requests.post('http://api.meaningcloud.com/parser-2.0', data=payload)
>>> r.status_code
200
Upvotes: 2