Reputation: 870
I have a tab seperated flatfile, one column of which is JSON data stored as a string, e.g.
Col1 Col2 Col3
1491109818 2017-08-02 00:00:09.250 {"type":"Tipper"}
1491110071 2017-08-02 00:00:19.283 {"type":"HGV"}
1491110798 2017-08-02 00:00:39.283 {"type":"Tipper"}
1491110798 2017-08-02 00:00:39.283 \N
...
What I want to do is load the table as a pandas dataframe, and for col3 change the data to a string with just the information from the type key. Where there is no JSON or a JSON without a type key I want to return None.
e.g.
Col1 Col2 Col3
1491109818 2017-08-02 00:00:09.250 Tipper
1491110071 2017-08-02 00:00:19.283 HGV
1491110798 2017-08-02 00:00:39.283 Tipper
1491110798 2017-08-02 00:00:39.283 None
...
The only way I can think to do this is with iterrows, however this is very slow when dealing with large files.
for index, row in df.iterrows():
try:
df.loc[index, 'Col3'] = json.loads(row['Col3'])['type']
except:
df.loc[index, 'Col3'] = None
Any suggestions on a quicker approach?
Upvotes: 0
Views: 1323
Reputation: 402553
Using np.vectorize
and json.loads
import json
def foo(x):
try:
return json.loads(x)['type']
except (ValueError, KeyError):
return None
v = np.vectorize(foo)
df.Col3 = v(df.Col3)
Note that it is never recommended to use a bare except
, as you can inadvertently catch and drop errors you didn't mean to.
df
Col1 Col2 Col3
0 1491109818 2017-08-02 00:00:09.250 Tipper
1 1491110071 2017-08-02 00:00:19.283 HGV
2 1491110798 2017-08-02 00:00:39.283 Tipper
3 1491110798 2017-08-02 00:00:39.283 None
Upvotes: 2