Matts
Matts

Reputation: 1341

Convert pandas series of json objects to DataFrame

I'm trying to figure out how to convert a pandas series of json objects from a slice of a DataFrame to a new DataFrame with the keys as columns. I've tried all similar solutions, but nothing seems to work. It either returns the same series, or generates errors. Some of the lines have different keys, they're not all the same, and the number of elements varies. There's a limited amount of variation, so each key can have it's own column. Is there a way to convert this type of series into a DataFrame?

1       {'length': '27 mm', 'style': 'Short', 'width': '9 mm'}
2       {'length': '51 mm', 'style': 'Long', 'width': '9 mm'}
5       {'length': '29 mm', 'style': 'Medium', 'width': '9 mm','id': '16684'}
6       {'width': '7 mm', 'length': '29 mm', 'style': 'Medium','mfg': '45-163'}
8       {'width': '8 mm', 'length': '33 mm', 'style': 'Medium'}

Upvotes: 1

Views: 416

Answers (2)

AdagioMolto
AdagioMolto

Reputation: 192

Assuming you have valid JSONs (e. g., double quotes, valid syntax etc., see discussion in Viacheslav Zhukov's answer) in your series, you can also use

import json
pd.DataFrame(
    my_json_series.apply(json.loads).to_list()
    , index=my_json_series.index  # optional if you want to keep the index
)

avoiding the for loop/list comprehension and being much faster.
These steps are rather

  1. convert each JSON to dict
  2. concat all dicts in a list/make the series a list
  3. make a DataFrame from the list

Also, I don't get a proper MultiIndex with the proposed for loop.

In [110]: %%timeit
     ...: my_out_df1 = pd.DataFrame(
     ...:     my_json_series.apply(json.loads).to_list()
     ...:     , index=my_json_series.index  # optional if you want to keep the index
     ...: )
362 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [111]: %%timeit
     ...: my_out_df_2 = pd.concat(
     ...:     [pd.DataFrame(json.loads(s), index=[ind])
     ...:      for ind, s in my_json_series.items()],
     ...:     sort=False
     ...: )
14.9 s ± 310 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I have to add that I tested with a completely different series than yours, mine had 4 elements in each json and about 10,000 elements in the series.

EDIT 2: To solve the problem with single quotes you could also try eval(my_single_quote_json) instead of json.loads. This even works if your JSON has varying quotes as it would read your string as Python code. If your JSONs contain nulls, you first need to define a variable null = 'some value or None'. Of course, using eval is never optimal, but it may help.

Upvotes: 0

Viacheslav Zhukov
Viacheslav Zhukov

Reputation: 1350

Well, I would suggest

  1. convert every json to dict
  2. convert every dict to DataFrame
  3. concat all dataframes, preserving indices

Data is in series in this example. Also It replaces ' with ":

import json
pd.concat([pd.DataFrame(json.loads(s), index=[ind]) for ind, s in series.str.replace('\'', '"').items()], sort=False)
    length  style   width   id      mfg
1   27 mm   Short   9 mm    NaN     NaN
2   51 mm   Long    9 mm    NaN     NaN
5   29 mm   Medium  9 mm    16684   NaN
6   29 mm   Medium  7 mm    NaN     45-163
8   33 mm   Medium  8 mm    NaN     NaN

Upvotes: 2

Related Questions