Reputation: 3036
I have a very large dataset in CSV format in which one column is a JSON string. I want to read this information into a flat Pandas data frame. How can I achieve this efficiently?
Input CSV:
col1,col2,col3,col4
1,Programming,"{""col3_1"":null,""col3_2"":""Java""}",11
2,Sport,"{""col3_1"":null,""col3_2"":""Soccer""}",22
3,Food,"{""col3_1"":null,""col3_2"":""Pizza""}",33
Expected DataFrame:
+---------------------------------------------------------------+
| col1 | col2 | col3_1 | col3_2 | col4 |
+---------------------------------------------------------------+
| 1 | Programming | None | Java | 11 |
| 2 | Sport | None | Soccer | 22 |
| 3 | Food | None | Pizza | 33 |
+---------------------------------------------------------------+
I can currently get the expected output using the following code. I just want to know if there is a more efficient way to achieve the same.
import json
import pandas
dataset = pandas.read_csv('/dataset.csv')
dataset['col3'] = dataset['col3'].apply(json.loads)
dataset['col3_1'] = dataset['col3'].apply(lambda row: row['col3_1'])
dataset['col3_2'] = dataset['col3'].apply(lambda row: row['col3_2'])
dataset = dataset.drop(columns=['col3'])
Upvotes: 2
Views: 2676
Reputation: 862581
Use for better performance DataFrame
contructor with DataFrame.pop
for extract column:
df1 = pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index)
df = df.join(df1)
print (df)
col1 col2 col4 col3_1 col3_2
0 1 Programming 11 None Java
1 2 Sport 22 None Soccer
2 3 Food 33 None Pizza
Detail:
print (df.pop('col3').apply(pd.io.json.loads))
0 {'col3_1': None, 'col3_2': 'Java'}
1 {'col3_1': None, 'col3_2': 'Soccer'}
2 {'col3_1': None, 'col3_2': 'Pizza'}
Name: col3, dtype: object
print (pd.DataFrame(df.pop('col3').apply(pd.io.json.loads).values.tolist(), index=df.index))
col3_1 col3_2
0 None Java
1 None Soccer
2 None Pizza
Solutions are similar, but performance is different:
df = pd.concat([df] * 10000, ignore_index=True)
In [204]: %timeit df.join(pd.DataFrame(df['col3'].apply(pd.io.json.loads).values.tolist(), index=df.index))
10 loops, best of 3: 76.4 ms per loop
In [205]: %timeit df.join(df['col3'].apply(lambda x: pd.Series(json.loads(x))))
1 loop, best of 3: 11.3 s per loop
Upvotes: 4
Reputation: 210832
you can parse JSON in Pandas column using json.loads()
and convert it to Pandas columns using pd.Series()
:
In [85]: df.join(df.pop('col3').apply(lambda x: pd.Series(json.loads(x))))
Out[85]:
col1 col2 col4 col3_1 col3_2
0 1 Programming 11 None Java
1 2 Sport 22 None Soccer
2 3 Food 33 None Pizza
Upvotes: 6