Reputation: 3086
Consider a snippet of a
{
"participant_id": 37,
"response_date": "2016-05-19T07:19:32.620Z",
"data": {
"summary": 8,
"q6": [
"1",
"2"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
},
{
"participant_id": 37,
"response_date": "2016-05-26T07:14:24.7130Z",
"data": {
"summary": 8,
"q6": [
"1",
"2",
"4"
],
"q1": 0,
"q2": 1,
"q3": 1,
"q4": 2,
"q5": 2
}
}
which will produce a Pandas data frame:
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2, 4] 7.0 37 2016-05-26 07:14:24.713
How to expand the nested q6
to a 'wider' format? There are up to 4 possible values, that this attribute q6
may contain. So, ideally it should be:
0 q1 q2 q3 q4 q5 q6 q7 q8 q9 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 0.0 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 1.0 1.0 0.0 1.0 7.0 37 2016-05-26 07:14:24.713
So, basically, the numbers in the square bracket encode the position of 1
in 4 element array.
Is there a simple Pandasian solution?
EDIT
Some entries are mistakenly reversed or randomly recorded (1st and 3rd rows):
0 q1 q2 q3 q4 q5 q6 summary participant_id response_date
672 NaN 0.0 1.0 1.0 2.0 2.0 [1, 2] 8.0 37 2016-05-19 07:19:32.620
711 NaN 0.0 1.0 1.0 2.0 2.0 [1] 7.0 37 2016-05-20 07:14:24.713
740 NaN 0.0 1.0 1.0 2.0 2.0 [2, 1] 8.0 37 2016-05-21 07:10:17.251
774 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-22 08:28:14.579
809 NaN 0.0 1.0 1.0 1.0 3.0 [1, 2] 8.0 37 2016-05-23 07:30:27.259
They should be sorted before any further manipulations are performed.
Upvotes: 3
Views: 543
Reputation: 176
>>> df1
participant_id q1 q2 q3 q4 q5 q6 response_date summary
0 37 0 1 1 2 2 [1, 2] 2016-05-19T07:19:32.620Z 8
1 37 0 1 1 2 2 [1, 2, 4] 2016-05-26T07:14:24.7130Z 8
>>> def widen(oldQ6):
... l = [0.0, 0.0, 0.0, 0.0]
... for i in oldQ6:
... l[int(i)-1] = 1.0
... return l
...
>>> df1['q6'], df1['q7'], df1['q8'], df1['q9'] = zip(*df1['q6'].map(widen))
>>> df1
participant_id q1 q2 q3 q4 q5 q6 response_date summary q7 q8 q9
0 37 0 1 1 2 2 1 2016-05-19T07:19:32.620Z 8 1 0 0
1 37 0 1 1 2 2 1 2016-05-26T07:14:24.7130Z 8 1 0 1
Upvotes: 1
Reputation: 862741
I think it is not so easy.
DataFrame
+ get_dummies
for new df reindex
for adding missing values + rename
columns
concat
to original, (q6
) column is removed
startswith
+ reindex_axis
for new order of columns df1 = pd.get_dummies(pd.DataFrame(df['q6'].values.tolist()), prefix_sep='', prefix='')
df1.columns = df1.columns.astype(int)
df1 =df1.reindex(columns=range(1,5),fill_value=0).rename(columns=lambda x: 'q{}'.format(x+5))
print (df1)
q6 q7 q8 q9
0 1 1 0 0
1 1 1 0 1
df = pd.concat([df.drop('q6', axis=1), df1], axis=1)
mask = df.columns.str.startswith('q', na=False)
cols1 = df.columns[mask].tolist()
cols2 = df.columns[~mask].tolist()
cols = cols2[:1] + cols1 + cols2[1:]
df = df.reindex_axis(cols, axis=1)
print (df)
0 q1 q2 q3 q4 q5 q6 q7 q8 q9 summary participant_id \
0 NaN 0 1 1 2 2 1 1 0 0 8 37
1 NaN 0 1 1 2 2 1 1 0 1 8 37
response_date
0 2016-05-19T07:19:32.620Z
1 2016-05-26T07:14:24.7130Z
Upvotes: 1
Reputation: 720
import pandas as pd
import numpy as np
df = pd.DataFrame({"q6":[[1,2],[1,2,4]]})
a = df.q6.values.tolist()
e = [np.array(x)-1 for x in a]
b = np.zeros((len(e),4))
for i in range(0, len(e)):
b[i][e[i]] = 1
c = np.hsplit(b,4)
df1 = pd.DataFrame({'q6':c[0],'q7':c[1],'q8':c[2],'q9':c[3]})
At the end i created other data frame with your 4 desired columns.
Upvotes: 1