CutePoison
CutePoison

Reputation: 5355

From column consisting of json-strings to DataFrame

I have a dataframe with a column data as below

data
----
'{"user":"[1,2]", "name":"[John,Doe]"}'
'{"user":"[3,4]", "name":"[Foo,Bar]"}'
'{"user":"[1,3]", "name":"[Baz,Bol]"}'

and I want a DataFrame like

user  | name
------+-----
1     | John
2     | Doe
3     | Foo
4     | Bar
1     | Baz
3     | Bol

I can do it by calling read_json on each row and append that to the previous read_json call e.g something like

df_end = pd.DataFrame(columns=["user","name"]) #init
for val in data:
    df_end = pd.concat(pd.read_json(val)) #concat it

but that clearly does not scale very well.

Is there a better way to accomplish it?

Upvotes: 1

Views: 34

Answers (2)

jezrael
jezrael

Reputation: 862581

You can try create dictionary of lists by defaultdict:

from collections import defaultdict

d = defaultdict(list)

for x in df['data']:
    for k, v in json.loads(x).items():
        d[k].extend(v.strip('[]').split(','))
        
df = pd.DataFrame(d)
print (df)  
  user  name
0    1  John
1    2   Doe
2    3   Foo
3    4   Bar
4    1   Baz
5    3   Bol

Upvotes: 0

Eric Truett
Eric Truett

Reputation: 3010

You have two options that are better than your current option:

(1) Create a list of dataframes and concatenate them at the end instead of concatenating after each iteration.

(2) Convert the column to a list of list of dict and then flatten the sequence and convert to a dataframe.

# option 1
df_end = pd.concat([pd.read_json(val) for val in data])

# option 2
# this gives you a nested list
# itertools.chain.from_iterable will effectively flatten it
import itertools 

data = [json.loads(val) for val in data]
df_end = pd.DataFrame(itertools.chain.from_iterable(data))

Upvotes: 2

Related Questions