Reputation: 1341
I'm trying to figure out how to convert a pandas series of json objects from a slice of a DataFrame to a new DataFrame with the keys as columns. I've tried all similar solutions, but nothing seems to work. It either returns the same series, or generates errors. Some of the lines have different keys, they're not all the same, and the number of elements varies. There's a limited amount of variation, so each key can have it's own column. Is there a way to convert this type of series into a DataFrame?
1 {'length': '27 mm', 'style': 'Short', 'width': '9 mm'}
2 {'length': '51 mm', 'style': 'Long', 'width': '9 mm'}
5 {'length': '29 mm', 'style': 'Medium', 'width': '9 mm','id': '16684'}
6 {'width': '7 mm', 'length': '29 mm', 'style': 'Medium','mfg': '45-163'}
8 {'width': '8 mm', 'length': '33 mm', 'style': 'Medium'}
Upvotes: 1
Views: 416
Reputation: 192
Assuming you have valid JSONs (e. g., double quotes, valid syntax etc., see discussion in Viacheslav Zhukov's answer) in your series, you can also use
import json
pd.DataFrame(
my_json_series.apply(json.loads).to_list()
, index=my_json_series.index # optional if you want to keep the index
)
avoiding the for loop/list comprehension and being much faster.
These steps are rather
DataFrame
from the listAlso, I don't get a proper MultiIndex
with the proposed for loop.
In [110]: %%timeit
...: my_out_df1 = pd.DataFrame(
...: my_json_series.apply(json.loads).to_list()
...: , index=my_json_series.index # optional if you want to keep the index
...: )
362 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [111]: %%timeit
...: my_out_df_2 = pd.concat(
...: [pd.DataFrame(json.loads(s), index=[ind])
...: for ind, s in my_json_series.items()],
...: sort=False
...: )
14.9 s ± 310 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
I have to add that I tested with a completely different series than yours, mine had 4 elements in each json and about 10,000 elements in the series.
EDIT 2: To solve the problem with single quotes you could also try eval(my_single_quote_json)
instead of json.loads
. This even works if your JSON has varying quotes as it would read your string as Python code. If your JSONs contain null
s, you first need to define a variable null = 'some value or None'
. Of course, using eval
is never optimal, but it may help.
Upvotes: 0
Reputation: 1350
Well, I would suggest
Data is in series
in this example. Also It replaces '
with "
:
import json
pd.concat([pd.DataFrame(json.loads(s), index=[ind]) for ind, s in series.str.replace('\'', '"').items()], sort=False)
length style width id mfg
1 27 mm Short 9 mm NaN NaN
2 51 mm Long 9 mm NaN NaN
5 29 mm Medium 9 mm 16684 NaN
6 29 mm Medium 7 mm NaN 45-163
8 33 mm Medium 8 mm NaN NaN
Upvotes: 2