Reputation: 13969
I have a problem of creating and rearrange a dataset. I Looked at the pandas groupby function and think it might help me do it but I am to inexperienced to make it happen. I have created an example of my problem below: my df:
vehicle color a b c d A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
resp
1 bike green 5 4 1 3 3 4 5 3 5 3 NaN NaN NaN NaN NaN NaN
2 walk red 5 3 3 3 4 5 3 3 5 4 NaN NaN NaN NaN NaN NaN
3 car green 4 2 3 3 4 3 5 4 5 5 NaN NaN NaN NaN NaN NaN
4 car blue 4 5 4 4 NaN NaN NaN NaN NaN NaN 5 5 5 3 3 4
5 bus black 2 4 4 3 NaN NaN NaN 2 3 3 2 2 1 NaN NaN NaN
6 car red 4 2 3 3 3 4 4 NaN NaN NaN 4 4 4 NaN NaN NaN
7 bus blue 5 5 2 3 3 3 5 4 3 2 NaN NaN NaN NaN NaN NaN
8 walk red 3 3 4 3 NaN NaN NaN 5 5 5 5 3 3 NaN NaN NaN
9 car blue 5 3 4 3 3 3 3 NaN NaN NaN 4 3 4 NaN NaN NaN
The dataset contains respondents and answers to a questionare. What I would like to do is to make a new dataframe with resp as index and the data from how the respondents answered rearranged. The data in columns a,b,c,d, vehicle and color are stacked for the respondents (Hope thats the right way to express it) in the new dataframe. Also the values from columns A to C are in the new frame under columns BL_val. Only the data that corresponds from Capital letter (A1-D3) to small letter (a,b,c,d) are filled in. The rest are NAN.
I would like to create a new dataframe from this and it shoul look like:
ds:
vehicle color sl sl_val BL_val1 BL_val2 BL_val3
resp
1 bike green a 5 3 4 5
1 bike green b 4 3 5 3
1 bike green c 1 NaN NaN NaN
1 bike green d 3 NaN NaN NaN
2 walk red a 5 4 5 3
2 walk red b 3 3 5 4
2 walk red c 3 NaN NaN NaN
2 walk red d 3 NaN NaN NaN
3 car green a 4 4 3 5
3 car green b 2 4 5 5
3 car green c 3 NaN NaN NaN
3 car green d 3 NaN NaN NaN
4 car blue a 4 NaN NaN NaN
4 car blue b 5 NaN NaN NaN
4 car blue c 4 5 5 5
4 car blue d 4 3 3 4
5 bus black a 2 NaN NaN NaN
5 bus black b 4 2 3 3
5 bus black c 4 2 2 1
5 bus black d 3 NaN NaN NaN
6 car red a 4 3 4 4
6 car red b 2 NaN NaN NaN
6 car red c 3 4 4 4
6 car red d 3 NaN NaN NaN
7 bus blue a 5 3 3 5
7 bus blue b 5 4 3 2
7 bus blue c 2 NaN NaN NaN
7 bus blue d 3 NaN NaN NaN
8 walk red a 3 NaN NaN NaN
8 walk red b 3 5 5 5
8 walk red c 4 5 3 3
8 walk red d 3 NaN NaN NaN
9 car blue a 5 3 3 3
9 car blue b 3 NaN NaN NaN
9 car blue c 4 4 3 4
9 car blue d NaN NaN NaN NaN
I really need some help with this, I cant figure it out!!
Upvotes: 6
Views: 10563
Reputation: 375535
A naive way would write a helper function to extract the relevant subDataFrames:
In [11]: def get_letter(df, letter):
res = df.loc[:, ['vehicle', 'color', letter] + [letter.upper() + str(i) for i in xrange(1, 4)]]
res.columns = ['vehicle', 'color', 'sl_val', 'BL_val1', 'BL_val2', 'BL_val3']
res['sl'] = letter
return res
In [12]: get_letter(df, 'a')
Out[12]:
vehicle color sl_val BL_val1 BL_val2 BL_val3 sl
resp
1 bike green 5 3 4 5 a
2 walk red 5 4 5 3 a
3 car green 4 4 3 5 a
4 car blue 4 NaN NaN NaN a
5 bus black 2 NaN NaN NaN a
6 car red 4 3 4 4 a
7 bus blue 5 3 3 5 a
8 walk red 3 NaN NaN NaN a
9 car blue 5 3 3 3 a
Then concating these gets the result you're after (possibly with a sort):
In [13]: pd.concat(get_letter(df, letter) for letter in 'abcd')
In [14]: pd.concat(get_letter(df, letter) for letter in 'abcd').sort()
Upvotes: 3
Reputation: 353119
There might be a more pandorable way to do this, but I find that the pattern of using groupby
to make the groups, performing explicit operations on them, and then recombining, is often a simple way to get what I want. Sure, I could spend half an hour coming up with something more elegant, but then I wouldn't have any time to hang out on SO..
Anyway, how about something like this?
df = df.set_index(["resp", "vehicle", "color"])
grouped = df.groupby(lambda x: x[0].lower(), axis=1)
new_grouped = []
for key, group in grouped:
group.columns = ["sl_val"] + ["BL_val{}".format(i) for i in range(1,4)]
group["sl"] = key
new_grouped.append(group)
df2 = pd.concat(new_grouped).reset_index()
df2 = df2.sort(["resp", "vehicle", "color"]).set_index("resp")
df2 = df2[["vehicle", "color", "sl"] + [k for k in df2.columns if "_" in k]]
Starting from
>>> df = df.set_index(["resp", "vehicle", "color"])
>>> df
a b c d A1 A2 A3 B1 B2 B3 C1 C2 C3 D1 D2 D3
resp vehicle color
1 bike green 5 4 1 3 3 4 5 3 5 3 NaN NaN NaN NaN NaN NaN
2 walk red 5 3 3 3 4 5 3 3 5 4 NaN NaN NaN NaN NaN NaN
3 car green 4 2 3 3 4 3 5 4 5 5 NaN NaN NaN NaN NaN NaN
4 car blue 4 5 4 4 NaN NaN NaN NaN NaN NaN 5 5 5 3 3 4
5 bus black 2 4 4 3 NaN NaN NaN 2 3 3 2 2 1 NaN NaN NaN
6 car red 4 2 3 3 3 4 4 NaN NaN NaN 4 4 4 NaN NaN NaN
7 bus blue 5 5 2 3 3 3 5 4 3 2 NaN NaN NaN NaN NaN NaN
8 walk red 3 3 4 3 NaN NaN NaN 5 5 5 5 3 3 NaN NaN NaN
9 car blue 5 3 4 3 3 3 3 NaN NaN NaN 4 3 4 NaN NaN NaN
We can group by the starting letter in lowercase:
>>> grouped = df.groupby(lambda x: x[0].lower(), axis=1)
Producing a bunch of groups which look like:
>>> next(iter(grouped))
('a', a A1 A2 A3
resp vehicle color
1 bike green 5 3 4 5
2 walk red 5 4 5 3
3 car green 4 4 3 5
4 car blue 4 NaN NaN NaN
5 bus black 2 NaN NaN NaN
6 car red 4 3 4 4
7 bus blue 5 3 3 5
8 walk red 3 NaN NaN NaN
9 car blue 5 3 3 3)
Then we simply change the names, add the "sl"
column, and recombine them using pd.concat
. The last bits simply match your desired order.
Final result:
>>> df2
vehicle color sl sl_val BL_val1 BL_val2 BL_val3
resp
1 bike green a 5 3 4 5
1 bike green b 4 3 5 3
1 bike green c 1 NaN NaN NaN
1 bike green d 3 NaN NaN NaN
2 walk red a 5 4 5 3
2 walk red b 3 3 5 4
2 walk red c 3 NaN NaN NaN
2 walk red d 3 NaN NaN NaN
3 car green a 4 4 3 5
3 car green b 2 4 5 5
3 car green c 3 NaN NaN NaN
3 car green d 3 NaN NaN NaN
4 car blue a 4 NaN NaN NaN
4 car blue b 5 NaN NaN NaN
4 car blue c 4 5 5 5
4 car blue d 4 3 3 4
5 bus black a 2 NaN NaN NaN
5 bus black b 4 2 3 3
5 bus black c 4 2 2 1
5 bus black d 3 NaN NaN NaN
6 car red a 4 3 4 4
6 car red b 2 NaN NaN NaN
6 car red c 3 4 4 4
6 car red d 3 NaN NaN NaN
7 bus blue a 5 3 3 5
7 bus blue b 5 4 3 2
7 bus blue c 2 NaN NaN NaN
7 bus blue d 3 NaN NaN NaN
8 walk red a 3 NaN NaN NaN
8 walk red b 3 5 5 5
8 walk red c 4 5 3 3
8 walk red d 3 NaN NaN NaN
9 car blue a 5 3 3 3
9 car blue b 3 NaN NaN NaN
9 car blue c 4 4 3 4
9 car blue d 3 NaN NaN NaN
Upvotes: 5