Josh Kidd
Josh Kidd

Reputation: 870

Parsing a column of JSON strings

I have a tab seperated flatfile, one column of which is JSON data stored as a string, e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 {"type":"Tipper"}
1491110071  2017-08-02 00:00:19.283 {"type":"HGV"}
1491110798  2017-08-02 00:00:39.283 {"type":"Tipper"}
1491110798  2017-08-02 00:00:39.283 \N
...

What I want to do is load the table as a pandas dataframe, and for col3 change the data to a string with just the information from the type key. Where there is no JSON or a JSON without a type key I want to return None.

e.g.

Col1        Col2                    Col3
1491109818  2017-08-02 00:00:09.250 Tipper
1491110071  2017-08-02 00:00:19.283 HGV
1491110798  2017-08-02 00:00:39.283 Tipper
1491110798  2017-08-02 00:00:39.283 None
...

The only way I can think to do this is with iterrows, however this is very slow when dealing with large files.

for index, row in df.iterrows():
    try:
        df.loc[index, 'Col3'] = json.loads(row['Col3'])['type']
    except:
        df.loc[index, 'Col3'] = None

Any suggestions on a quicker approach?

Upvotes: 0

Views: 1323

Answers (1)

cs95
cs95

Reputation: 402553

Using np.vectorize and json.loads

import json

def foo(x):
    try:
        return json.loads(x)['type']
    except (ValueError, KeyError):
        return None

v = np.vectorize(foo)
df.Col3 = v(df.Col3)

Note that it is never recommended to use a bare except, as you can inadvertently catch and drop errors you didn't mean to.


df

         Col1                     Col2    Col3
0  1491109818  2017-08-02 00:00:09.250  Tipper
1  1491110071  2017-08-02 00:00:19.283     HGV
2  1491110798  2017-08-02 00:00:39.283  Tipper
3  1491110798  2017-08-02 00:00:39.283    None

Upvotes: 2

Related Questions