Qiaoyi Li
Qiaoyi Li

Reputation: 23

what is the difference between `json.loads()` and `.apply(json.loads)`?

I am quite new to coding, and now I am trying to work on TMDB_5000 dataset from kaggle.

I ran into a problem when trying to deal with json format data like this.

[{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_i...}]

I am trying to use json.loads() to deal with data, the code is credits['cast'] = json.loads(credits['cast']). But it give me an error like this

---------------------------------------------------------------------------

TypeError Traceback (most recent call last) in () ----> 1 credits['cast'] = json.loads(credits['cast'])

/anaconda3/lib/python3.6/json/__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant,

object_pairs_hook, **kw) 346 if not isinstance(s, (bytes, bytearray)): 347 raise TypeError('the JSON object must be str, bytes or bytearray, ' --> 348 'not {!r}'.format(s.class.name)) 349 s = s.decode(detect_encoding(s), 'surrogatepass') 350

TypeError: the JSON object must be str, bytes or bytearray, not 'Series'

However, the code credits['cast'] = credits['cast'].apply(json.loads)works. So I am very confused, because I think there isn't difference between this two lines of code.

Can anyone explain that to me?

Upvotes: 0

Views: 2650

Answers (3)

Karn Kumar
Karn Kumar

Reputation: 8826

However explanation with great details already been provided, but would like to add in case you are using pandas to read and process data then you can use:

import pandas as pd
d_list = [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri"}]

Create a DataFrame with using DataFrame.from_dict

df = pd.DataFrame.from_dict(d_list)
print(df)

cast_id   character                 credit_id  gender       id             name  order
0      242  Jake Sully  5602a8a7c3a3685532001c9a     2.0  65731.0  Sam Worthington    0.0
1        3     Neytiri                       NaN     NaN      NaN              NaN    NaN

Another way around which suited for this ppurpose is pd.read_json with orient='records'.

import pandas as pd
d_list = [{"cast_id": 242, "character": "Jake Sully", "credit_id": "5602a8a7c3a3685532001c9a", "gender": 2, "id": 65731, "name": "Sam Worthington", "order": 0}, {"cast_id": 3, "character": "Neytiri", "credit_i...}]
df = pd.read_json(d_list, orient='records')
print(df

Upvotes: 0

Blckknght
Blckknght

Reputation: 104877

The issue is that your credits variable is a Pandas DataFrame and so credits['cast'] is a Series). The json.loads function doesn't know how to deal with data types from pandas, so you get an error when you do json.loads(credits['cast']).

The Series type however has an apply method that accepts a function to be called on each value it contains. That's why credits['cast'].apply(json.loads) works, it passes json.loads as the argument to apply.

Upvotes: 2

DYZ
DYZ

Reputation: 57145

The following code:

credits['cast'] = credits['cast'].apply(json.loads)

applies function json.loads to each row of credits['cast'] (each row being a string). The result is a series of decoded objects.

The following code:

credits['cast'] = json.loads(credits['cast'])

attempts to apply the same function to the Series credits['cast'], but the function cannot be applied to a Series.

Upvotes: 0

Related Questions