royalewithcheese
royalewithcheese

Reputation: 502

How to split the header row to make new columns?

Supposing I have a df like this,

uid | 202012AP   | 202012CAN  | 202012CASA | 202012total  | 202101AP   | 202101CAN  | 202101CASA | 202101total  |...
A1  |   0        |   1        |   10       |    11        |     1      |   2        |   0        |   3          |...
A2  |   1        |   8        |   99       |    108       |     4      |   5        |   1        |   10         |...
B1  |   0        |   1        |   99       |    100       |     2      |   3        |   2        |   7          |...

I want to be able to generate output df like the following,

uid | no     | reg  | amount
A1  | 202012 | AP   |   0
A1  | 202012 | CAN  |   1
A1  | 202012 | CASA |   10
A1  | 202101 | AP   |   1
A1  | 202101 | CAN  |   2
A1  | 202101 | CASA |   0
A2  | 202012 | AP   |   1
A2  | 202012 | CAN  |   8
A2  | 202012 | CASA |   99
A2  | 202101 | AP   |   4
A2  | 202101 | CAN  |   5
A2  | 202101 | CASA |   1

and so on..

Is there a pandas library to achieve this task? Any help would be really appreciated.

P.S. I actually don't need 01total or 02total cols to use in the output df but I provided it since that is the df have.

Upvotes: 0

Views: 90

Answers (3)

sammywemmy
sammywemmy

Reputation: 28729

Keep only the relevant columns:

box = df.filter(regex="uid|0[1-2]reg.")

Create a dataframe by splitting with a regular expression

cols = (
    pd.Series(box.columns)
    .str.split("(reg.)", expand=True)
    .iloc[:, :2]
    .fillna("") # this allows easy melting on "uid"
    .set_axis(["no", "reg"], axis="columns")
)

Assign cols as a MultiIndex to box.columns:

box.columns = pd.MultiIndex.from_frame(cols)
box


no  uid                 01                       02
reg     reg1    reg2    reg3    reg1    reg2    reg3
0   A1     0    1         10    1          2    0
1   A2     1    8         99    4          5    1
2   B1     0    1         99    2          3    2

Now you can melt:

box.melt("uid")

uid no  reg value
0   A1  01  reg1    0
1   A2  01  reg1    1
2   B1  01  reg1    0
3   A1  01  reg2    1
4   A2  01  reg2    8
5   B1  01  reg2    1
6   A1  01  reg3    10
7   A2  01  reg3    99
8   B1  01  reg3    99
9   A1  02  reg1    1
10  A2  02  reg1    4
11  B1  02  reg1    2
12  A1  02  reg2    2
13  A2  02  reg2    5
14  B1  02  reg2    3
15  A1  02  reg3    0
16  A2  02  reg3    1
17  B1  02  reg3    2

UPDATE (06/01/2021): A simpler(opinion mine) abstraction is possible with the pivot_longer function from pyjanitor; at the moment you have to install the latest development version from github:

 # install latest dev version
 # pip install git+https://github.com/ericmjl/pyjanitor.git
  import janitor
df.filter(regex="^(?!.*total)").pivot_longer(
    index="uid",
    names_pattern="(\d+)(.+)",
    names_to=("no", "reg"),
    values_to="amount",
    sort_by_appearance=True,
)

    uid no     reg  amount
0   A1  202012  AP      0
1   A1  202012  CAN     1
2   A1  202012  CASA    10
3   A1  202101  AP      1
4   A1  202101  CAN     2
5   A1  202101  CASA    0
6   A2  202012  AP      1
7   A2  202012  CAN     8
8   A2  202012  CASA    99
9   A2  202101  AP      4
10  A2  202101  CAN     5
11  A2  202101  CASA    1
12  B1  202012  AP      0
13  B1  202012  CAN     1
14  B1  202012  CASA    99
15  B1  202101  AP      2
16  B1  202101  CAN     3
17  B1  202101  CASA    2

Upvotes: 3

jezrael
jezrael

Reputation: 863751

Idea is convert iud to index by DataFrame.set_index first, then create MultiIndex in columns by Index.str.extract and reshape by DataFrame.stack:

df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]

df1.columns = df1.columns.str.extract('(\d+)(\w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
   uid  no   reg  amount
0   A1  01  reg1       0
1   A1  01  reg2       1
2   A1  01  reg3      10
3   A1  02  reg1       1
4   A1  02  reg2       2
5   A1  02  reg3       0
6   A2  01  reg1       1
7   A2  01  reg2       8
8   A2  01  reg3      99
9   A2  02  reg1       4
10  A2  02  reg2       5
11  A2  02  reg3       1
12  B1  01  reg1       0
13  B1  01  reg2       1
14  B1  01  reg3      99
15  B1  02  reg1       2
16  B1  02  reg2       3
17  B1  02  reg3       2

EDIT:

df1 = df.set_index('uid')
df1 = df1.loc[:, ~df1.columns.str.endswith('total')]

df1.columns = df1.columns.str.extract('(\d+)(\w+)').set_index([0,1]).index
df1 = df1.rename_axis(['no','reg'], axis=1).stack([0,1]).reset_index(name='amount')
print (df1)
   uid      no   reg  amount
0   A1  202012    AP       0
1   A1  202012   CAN       1
2   A1  202012  CASA      10
3   A1  202101    AP       1
4   A1  202101   CAN       2
5   A1  202101  CASA       0
6   A2  202012    AP       1
7   A2  202012   CAN       8
8   A2  202012  CASA      99
9   A2  202101    AP       4
10  A2  202101   CAN       5
11  A2  202101  CASA       1
12  B1  202012    AP       0
13  B1  202012   CAN       1
14  B1  202012  CASA      99
15  B1  202101    AP       2
16  B1  202101   CAN       3
17  B1  202101  CASA       2

Upvotes: 3

Dani Mesejo
Dani Mesejo

Reputation: 61930

You could use a combination of melt, concat, and sort_values:

# create long df using melt, cols is just an intermediary column
res = df.melt(id_vars='uid', value_name='amount', var_name='cols')

# remove total rows
res = res[~res['cols'].str.endswith('total')]

# concat the two new columns no and reg with the previously melted df
res = pd.concat((res.drop('cols', 1), res['cols'].str.extract(r'(?P<no>\d+)(?P<reg>\w+)')), axis=1)

# sort both columns and rows to match the expected output
res = res.sort_values(['uid', 'no', 'reg'])[['uid', 'no', 'reg', 'amount']]

print(res)

Output

   uid  no   reg  amount
0   A1  01  reg1       0
3   A1  01  reg2       1
6   A1  01  reg3      10
12  A1  02  reg1       1
15  A1  02  reg2       2
18  A1  02  reg3       0
1   A2  01  reg1       1
4   A2  01  reg2       8
7   A2  01  reg3      99
13  A2  02  reg1       4
16  A2  02  reg2       5
19  A2  02  reg3       1
2   B1  01  reg1       0
5   B1  01  reg2       1
8   B1  01  reg3      99
14  B1  02  reg1       2
17  B1  02  reg2       3
20  B1  02  reg3       2

Notice that we also use str.extract as an intermediary step:

res['cols'].str.extract(r'(?P<no>\d+)(?P<reg>\w+)'))

this is using a regex with named groups to directly create columns with names.

Upvotes: 4

Related Questions