Reputation: 2714
I have the following Pandas Dataframe (named dx
):
_id user_type
0 {'$oid': '5b9058462f38434ab0d85cd3'} 1
1 {'$oid': '5b9058462f38434ab0d85ce9'} 1
2 {'$oid': '5b9058462f38434ab0d85ced'} 1
3 {'$oid': '5b9058462f38434ab0d85cee'} 1
4 {'$oid': '5b9058462f38434ab0d85cef'} 1
It came from a big JSON (more or less 500.000 rows and more columns then this example), as you can see the _id
contain a nested JSON.
For this example, my target is to get a new column named oid
with the oid code:
_id user_type oid
0 {'$oid': '5b9058462f38434ab0d85cd3'} 1 5b9058462f38434ab0d85cd3
1 {'$oid': '5b9058462f38434ab0d85ce9'} 1 5b9058462f38434ab0d85ce9
2 {'$oid': '5b9058462f38434ab0d85ced'} 1 5b9058462f38434ab0d85ced
3 {'$oid': '5b9058462f38434ab0d85cee'} 1 5b9058462f38434ab0d85cee
4 {'$oid': '5b9058462f38434ab0d85cef'} 1 5b9058462f38434ab0d85cef
I achieved the results using the following snippet:
dx['oid']=None
for i in dx.index:
dx['oid'][i]=dx.at[i,'_id']['$oid']
This give me what I am looking for but it is very very slow. It takes 3-4 minutes to populate the column of the example, which has only 5 rows!
How can I optimize the creation of a new column based on other columns with JSON values?
I can't use regex because nested JSON is core complex then the one in the example.
Upvotes: 2
Views: 2996
Reputation: 164773
You can use operator.itemgetter
. No vectorised solution is possible since your input series has object
dtype.
from operator import itemgetter
field_name = '$oid'
df[field_name] = list(map(itemgetter(field_name), df['_id']))
More generalised solutions are available, see Splitting dictionary/list inside a Pandas Column into Separate Columns.
Upvotes: 2
Reputation: 323326
This is object
column type which contain single dict
within the cell so
df['$oid']=df['_id'].map(lambda x : x['$oid'])
Or
s=df['_id'].apply(pd.Series)
s
#df=pd.concat([df,s],axis=1)
Out[601]:
$oid
0 5b9058462f38434ab0d85cd3
1 5b9058462f38434ab0d85ce9
2 5b9058462f38434ab0d85ced
3 5b9058462f38434ab0d85cee
4 5b9058462f38434ab0d85cef
Upvotes: 3