ZoeAlleyneWashburne
ZoeAlleyneWashburne

Reputation: 45

Replicating %transpose SAS macro in Python

My company is moving away from SAS in favor of other tools, Python being the preferred one. I have a lot of legacy code that runs in SAS that will need to be ported over and some things are easier to do than others.

There are many datasets I work with that need be be pivoted such that there is one line per unique id column(s) and for that I've made extensive use of the %transpose macro found here: https://raw.githubusercontent.com/art297/transpose/master/transpose.sas. Essentially, it allows me to make a single function call where I pass the input dataset name, output dataset name, id column(s), and the columns to be pivoted to get my desired output.

I've been working to replicate the functionality using Python and I've gotten very close, but I'm stuck at the moment.

Here's an example :

import pandas as pd

cust = ['CUST123', 'CUST123', 'CUST123', 'CUST456', 'CUST456']
start_date = ['01/01/2021','01/02/2021','01/03/2021','01/04/2021','01/05/2021']
end_date = ['01/11/2021','01/12/2021','01/13/2021','01/14/2021','01/15/2021']
code = ['ABC123','DEF456','GHI789','JKL123','MNO456']
df = pd.DataFrame(list(zip(cust, start_date, end_date, code )), columns = ['cust', 'start_date', 'end_date', 'code'])

df['start_date'] = pd.to_datetime(df['start_date'])
df['end_date'] = pd.to_datetime(df['end_date'])

This gives me a sample dataset that looks like this :

      cust start_date   end_date    code
0  CUST123 2021-01-01 2021-01-11  ABC123
1  CUST123 2021-01-02 2021-01-12  DEF456
2  CUST123 2021-01-03 2021-01-13  GHI789
3  CUST456 2021-01-04 2021-01-14  JKL123
4  CUST456 2021-01-05 2021-01-15  MNO456

In SAS I would use :

%transpose(data=INPUT, 
   out=OUTPUT, 
   by=cust, 
   delimiter=_,
   var=start_date end_date code);

to get this as my desired output :

     cust start_date_1 end_date_1 code_1 start_date_2 end_date_2 code_2 start_date_3 end_date_3 code_3
0 CUST123   2021-01-01 2021-01-11 ABC123   2021-01-02 2021-01-12 DEF456   2021-01-03 2021-01-13 GHI789
1 CUST456   2021-01-04 2021-01-14 JKL123   2021-01-05 2021-01-15 MNO456

So far I've used this code :

df['idx'] = df.groupby('cust').cumcount()+1
df = df.pivot_table(index='cust', columns='idx', values=['start_date', 'end_date', 'code'], aggfunc='first')
df = df.sort_index(axis=1, level=1)
df.columns = [f'{x}_{y}' for x, y in df.columns]
df.reset_index()

However that yields :

      cust  code_1 end_date_1 start_date_1  code_2 end_date_2 start_date_2  code_3 end_date_3 start_date_3
0  CUST123  ABC123 2021-01-11   2021-01-01  DEF456 2021-01-12   2021-01-02  GHI789 2021-01-13   2021-01-03
1  CUST456  JKL123 2021-01-14   2021-01-04  MNO456 2021-01-15   2021-01-05

This is so very close to what I'm after. The only issue that the SAS macro keeps the pivoted columns in the order I specified in the function call. The Python code puts them in alphabetical order regardless of what I specify because that's just how it functions.

I could insert a for loop to prepend my pivot fields with [a...b...c...etc...] before sorting them and pivoting, then loop through the column headers to change them to include everything but the first character, but I can't imagine that is the best way to do it. It's ugly and I'm almost certain there's some other method I've not found in all my searching that's preferable.

Am I stuck using the aforementioned ugly method or am I correct that there's a better way?

Upvotes: 3

Views: 142

Answers (1)

sammywemmy
sammywemmy

Reputation: 28644

pass the sort_remaining=False in the sort_index function, to keep your columns as is.

result = (
    df.assign(idx=df.groupby("cust").cumcount() + 1)
    .pivot("cust", "idx")
    .sort_index(axis="columns", level=1, sort_remaining=False)
)

result.columns = result.columns.map(lambda x: f"{x[0]}_{x[1]}")

Upvotes: 1

Related Questions