renakre
renakre

Reputation: 8291

Using apply to insert a value extracted from one column (with json type) to another one

I have this dataset:

userid   sub_id    event
1        NaN       {'score':25, 'sub_id':5}
1        5         {'score':1}

When sub_id column is NaN, I want to extract this information from event column using the following code:

df['sub_id'] = df.apply(lambda row: 
                        row['event'].split('sub_id')[1] 
                        if pd.isnull(row['sub_id']) 
                        else row['sub_id']) 

However, I am receiving this error: KeyError: ('sub_id', u'occurred at index index')

I am trying to obtain this dataframe:

userid   sub_id    event
1        5         {'score':25, 'sub_id':5}
1        5         {'score':1}

Any idea for the error, or any suggestions for a different solution?

UPDATE

I need to extract the value that resides in a nested dict element:

event
{u'POST': {u'{"options_selected":{"Ideas":"0"},"criterion_feedback":{},"overall_feedback":"Feedback_text_goes_here_1"}': [u'']}, u'GET': {}}

I am using this code:

df['POST'] = df['event'].apply(pd.Series)['POST']

which creates the following column:

POST
{u'{"options_selected":{"Ideas":"0"},"criterion_feedback":{},"overall_feedback":"Feedback_text_goes_here_1"}': [u'']}

However, I need to get the overall_feedback value. Because of the formatting of the POST field, the following code does not work:

df['POST'].apply(pd.Series)['overall_feedback']

It throws this error KeyError: 'overall_feedback'

Any ideas?

Upvotes: 2

Views: 93

Answers (1)

jezrael
jezrael

Reputation: 863166

You can use combine_first or fillna:

print (type(df.loc[0, 'event']))
<class 'dict'>

df['sub_id'] = df['sub_id'].combine_first(df.event.apply(lambda x: x['score']))
#df['sub_id'] = df['sub_id'].fillna(df.event.apply(lambda x: x['score']))
print (df)
                       event  sub_id  userid
0  {'sub_id': 5, 'score': 5}     5.0       1
1               {'score': 1}     5.0       1

EDIT: If nested dict, faster solutio is use double DataFame constructor and slowier solution double apply with Series:

df = pd.DataFrame({'userid':[1,1],
                   'sub_id':[np.nan, 5],
                   'event':[{'post':{'score':25, 'sub_id':5}},{'post':{'score':1}} ]})

print (df)
                                  event  sub_id  userid
0  {'post': {'sub_id': 5, 'score': 25}}     NaN       1
1                {'post': {'score': 1}}     5.0       1

s = pd.DataFrame(pd.DataFrame(df['event'].values.tolist())['post'].values.tolist())['score']
print (s)
0    25
1     1
Name: score, dtype: int64
s = df['event'].apply(pd.Series)['post'].apply(pd.Series)['score']
print (s)
0    25.0
1     1.0
Name: score, dtype: float64

df['sub_id'] = df['sub_id'].combine_first(s)
print (df)
                                  event  sub_id  userid
0  {'post': {'sub_id': 5, 'score': 25}}    25.0       1
1                {'post': {'score': 1}}     5.0       1

EDIT1:

For converting to dict is possible use:

import ast, yaml

df = pd.DataFrame({'userid':[1,1],
                   'sub_id':[np.nan, 5],
                   'event':[{'post':{'score':25, 'sub_id':5}},{'post':{'score':1}} ]})

df.event = df.event.astype(str)
print (type(df.loc[0, 'event']))
<class 'str'>

df['event'] = df['event'].apply(ast.literal_eval)
#df['event'] = df['event'].apply(yaml.load)
print (df)
                                  event  sub_id  userid
0  {'post': {'sub_id': 5, 'score': 25}}     NaN       1
1                {'post': {'score': 1}}     5.0       1

print (type(df.loc[0, 'event']))
<class 'dict'>

EDIT2:

d = {u'{"options_selected":{"Ideas":"0"},"criterion_feedback":{},"overall_feedback":"Feedback_text_goes_here_1"}': [u'']}
d1 = {u'{"options_selected":{"Ideas":"2"},"criterion_feedback":{},"overall_feedback":"Feedback_text_goes_here_2"}': [u'']}

df = pd.DataFrame({'userid':[1,1],
                   'sub_id':[np.nan, 5],
                   'event':[d,d1]})

df['event'] = df['event'].astype(str).apply(yaml.load).apply(lambda x : yaml.load(list(x.keys())[0]))

print (type(df.event.iloc[0]))
<class 'dict'>

print (df.event.apply(pd.Series)['overall_feedback'])
0    Feedback_text_goes_here_1
1    Feedback_text_goes_here_2
Name: overall_feedback, dtype: object

Upvotes: 2

Related Questions