Reputation: 502
Supposing I have a df like this,
uid | 202012AP | 202012CAN | 202012CASA | 202012total | 202101AP | 202101CAN | 202101CASA | 202101total |...
A1 | 0 | 1 | 10 | 11 | 1 | 2 | 0 | 3 |...
A2 | 1 | 8 | 99 | 108 | 4 | 5 | 1 | 10 |...
B1 | 0 | 1 | 99 | 100 | 2 | 3 | 2 | 7 |...
I want to be able to generate output df like the following,
uid | no | reg | amount
A1 | 202012 | AP | 0
A1 | 202012 | CAN | 1
A1 | 202012 | CASA | 10
A1 | 202101 | AP | 1
A1 | 202101 | CAN | 2
A1 | 202101 | CASA | 0
A2 | 202012 | AP | 1
A2 | 202012 | CAN | 8
A2 | 202012 | CASA | 99
A2 | 202101 | AP | 4
A2 | 202101 | CAN | 5
A2 | 202101 | CASA | 1
and so on..
Is there a pandas library to achieve this task? Any help would be really appreciated.
P.S. I actually don't need 01total or 02total cols to use in the output df but I provided it since that is the df have.
Upvotes: 0
Views: 90
Reputation: 28729
Keep only the relevant columns:
box = df.filter(regex="uid|0[1-2]reg.")
Create a dataframe by splitting with a regular expression
cols = (
pd.Series(box.columns)
.str.split("(reg.)", expand=True)
.iloc[:, :2]
.fillna("") # this allows easy melting on "uid"
.set_axis(["no", "reg"], axis="columns")
)
Assign cols
as a MultiIndex to box.columns
:
box.columns = pd.MultiIndex.from_frame(cols)
box
no uid 01 02
reg reg1 reg2 reg3 reg1 reg2 reg3
0 A1 0 1 10 1 2 0
1 A2 1 8 99 4 5 1
2 B1 0 1 99 2 3 2
Now you can melt:
box.melt("uid")
uid no reg value
0 A1 01 reg1 0
1 A2 01 reg1 1
2 B1 01 reg1 0
3 A1 01 reg2 1
4 A2 01 reg2 8
5 B1 01 reg2 1
6 A1 01 reg3 10
7 A2 01 reg3 99
8 B1 01 reg3 99
9 A1 02 reg1 1
10 A2 02 reg1 4
11 B1 02 reg1 2
12 A1 02 reg2 2
13 A2 02 reg2 5
14 B1 02 reg2 3
15 A1 02 reg3 0
16 A2 02 reg3 1
17 B1 02 reg3 2
UPDATE (06/01/2021): A simpler(opinion mine) abstraction is possible with the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:
# install latest dev version
# pip install git+https://github.com/ericmjl/pyjanitor.git
import janitor
df.filter(regex="^(?!.*total)").pivot_longer(
index="uid",
names_pattern="(\d+)(.+)",
names_to=("no", "reg"),
values_to="amount",
sort_by_appearance=True,
)
uid no reg amount
0 A1 202012 AP 0
1 A1 202012 CAN 1
2 A1 202012 CASA 10
3 A1 202101 AP 1
4 A1 202101 CAN 2
5 A1 202101 CASA 0
6 A2 202012 AP 1
7 A2 202012 CAN 8
8 A2 202012 CASA 99
9 A2 202101 AP 4
10 A2 202101 CAN 5
11 A2 202101 CASA 1
12 B1 202012 AP 0
13 B1 202012 CAN 1
14 B1 202012 CASA 99
15 B1 202101 AP 2
16 B1 202101 CAN 3
17 B1 202101 CASA 2
Upvotes: 3
Reputation: 863751
Idea is convert iud
to index
by DataFrame.set_index
first, then create MultiIndex in columns
by Index.str.extract
and reshape by DataFrame.stack
:
df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(\d+)(\w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid no reg amount
0 A1 01 reg1 0
1 A1 01 reg2 1
2 A1 01 reg3 10
3 A1 02 reg1 1
4 A1 02 reg2 2
5 A1 02 reg3 0
6 A2 01 reg1 1
7 A2 01 reg2 8
8 A2 01 reg3 99
9 A2 02 reg1 4
10 A2 02 reg2 5
11 A2 02 reg3 1
12 B1 01 reg1 0
13 B1 01 reg2 1
14 B1 01 reg3 99
15 B1 02 reg1 2
16 B1 02 reg2 3
17 B1 02 reg3 2
EDIT:
df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]
df1.columns = df1.columns.str.extract('(\d+)(\w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
uid no reg amount
0 A1 202012 AP 0
1 A1 202012 CAN 1
2 A1 202012 CASA 10
3 A1 202101 AP 1
4 A1 202101 CAN 2
5 A1 202101 CASA 0
6 A2 202012 AP 1
7 A2 202012 CAN 8
8 A2 202012 CASA 99
9 A2 202101 AP 4
10 A2 202101 CAN 5
11 A2 202101 CASA 1
12 B1 202012 AP 0
13 B1 202012 CAN 1
14 B1 202012 CASA 99
15 B1 202101 AP 2
16 B1 202101 CAN 3
17 B1 202101 CASA 2
Upvotes: 3
Reputation: 61930
You could use a combination of melt, concat, and sort_values:
# create long df using melt, cols is just an intermediary column
res = df.melt(id_vars='uid', value_name='amount', var_name='cols')
# remove total rows
res = res[~res['cols'].str.endswith('total')]
# concat the two new columns no and reg with the previously melted df
res = pd.concat((res.drop('cols', 1), res['cols'].str.extract(r'(?P<no>\d+)(?P<reg>\w+)')), axis=1)
# sort both columns and rows to match the expected output
res = res.sort_values(['uid', 'no', 'reg'])[['uid', 'no', 'reg', 'amount']]
print(res)
Output
uid no reg amount
0 A1 01 reg1 0
3 A1 01 reg2 1
6 A1 01 reg3 10
12 A1 02 reg1 1
15 A1 02 reg2 2
18 A1 02 reg3 0
1 A2 01 reg1 1
4 A2 01 reg2 8
7 A2 01 reg3 99
13 A2 02 reg1 4
16 A2 02 reg2 5
19 A2 02 reg3 1
2 B1 01 reg1 0
5 B1 01 reg2 1
8 B1 01 reg3 99
14 B1 02 reg1 2
17 B1 02 reg2 3
20 B1 02 reg3 2
Notice that we also use str.extract as an intermediary step:
res['cols'].str.extract(r'(?P<no>\d+)(?P<reg>\w+)'))
this is using a regex with named groups to directly create columns with names.
Upvotes: 4