Reputation: 23
I am quite new to coding, and now I am trying to work on TMDB_5000 dataset from kaggle.
I ran into a problem when trying to deal with json format data like this.
[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_i...}]
I am trying to use json.loads()
to deal with data, the code is credits['cast'] = json.loads(credits['cast'])
. But it give me an error like this
---------------------------------------------------------------------------
TypeError Traceback (most recent call last) in () ----> 1 credits['cast'] = json.loads(credits['cast'])
/anaconda3/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant,
object_pairs_hook, **kw) 346 if not isinstance(s, (bytes, bytearray)): 347 raise TypeError('the JSON object must be str, bytes or bytearray, ' --> 348 'not {!r}'.format(s.class.name)) 349 s = s.decode(detect_encoding(s), 'surrogatepass') 350
TypeError: the JSON object must be str, bytes or bytearray, not 'Series'
However, the code credits['cast'] = credits['cast'].apply(json.loads)
works. So I am very confused, because I think there isn't difference between this two lines of code.
Can anyone explain that to me?
Upvotes: 0
Views: 2650
Reputation: 8826
However explanation with great details already been provided, but would like to add in case you are using pandas to read and process data then you can use:
import pandas as pd
d_list = [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri"}]
Create a DataFrame with using DataFrame.from_dict
df = pd.DataFrame.from_dict(d_list)
print(df)
cast_id character credit_id gender id name order
0 242 Jake Sully 5602a8a7c3a3685532001c9a 2.0 65731.0 Sam Worthington 0.0
1 3 Neytiri NaN NaN NaN NaN NaN
Another way around which suited for this ppurpose is pd.read_json
with orient='records'
.
import pandas as pd
d_list = [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_i...}]
df = pd.read_json(d_list, orient='records')
print(df
Upvotes: 0
Reputation: 104877
The issue is that your credits
variable is a Pandas DataFrame
and so credits['cast']
is a Series
). The json.loads
function doesn't know how to deal with data types from pandas
, so you get an error when you do json.loads(credits['cast'])
.
The Series
type however has an apply
method that accepts a function to be called on each value it contains. That's why credits['cast'].apply(json.loads)
works, it passes json.loads
as the argument to apply
.
Upvotes: 2
Reputation: 57145
The following code:
credits['cast'] = credits['cast'].apply(json.loads)
applies function json.loads
to each row of credits['cast']
(each row being a string). The result is a series of decoded objects.
The following code:
credits['cast'] = json.loads(credits['cast'])
attempts to apply the same function to the Series credits['cast']
, but the function cannot be applied to a Series.
Upvotes: 0