Reputation: 5355
I have a dataframe with a column data
as below
data
----
'{"user":"[1,2]", "name":"[John,Doe]"}'
'{"user":"[3,4]", "name":"[Foo,Bar]"}'
'{"user":"[1,3]", "name":"[Baz,Bol]"}'
and I want a DataFrame like
user | name
------+-----
1 | John
2 | Doe
3 | Foo
4 | Bar
1 | Baz
3 | Bol
I can do it by calling read_json
on each row and append that to the previous read_json
call e.g something like
df_end = pd.DataFrame(columns=["user","name"]) #init
for val in data:
df_end = pd.concat(pd.read_json(val)) #concat it
but that clearly does not scale very well.
Is there a better way to accomplish it?
Upvotes: 1
Views: 34
Reputation: 862581
You can try create dictionary of lists by defaultdict
:
from collections import defaultdict
d = defaultdict(list)
for x in df['data']:
for k, v in json.loads(x).items():
d[k].extend(v.strip('[]').split(','))
df = pd.DataFrame(d)
print (df)
user name
0 1 John
1 2 Doe
2 3 Foo
3 4 Bar
4 1 Baz
5 3 Bol
Upvotes: 0
Reputation: 3010
You have two options that are better than your current option:
(1) Create a list of dataframes and concatenate them at the end instead of concatenating after each iteration.
(2) Convert the column to a list of list of dict and then flatten the sequence and convert to a dataframe.
# option 1
df_end = pd.concat([pd.read_json(val) for val in data])
# option 2
# this gives you a nested list
# itertools.chain.from_iterable will effectively flatten it
import itertools
data = [json.loads(val) for val in data]
df_end = pd.DataFrame(itertools.chain.from_iterable(data))
Upvotes: 2