BENAMARA ILAN
BENAMARA ILAN

Reputation: 49

Transpose a Python Dataframe according many variables

I have a Dataframe from one file, i have products and price together. I have on file per day who registered all the products bought by customer. So the length of columns depends of the biggest number of product bought by One client. On the start, i have a file like that :

date     conv   product      Prices 
01/2016 'part ' A|B|C|E|F 15|20|30|40|50 
01/2016 'Pro'   D|B       10|10 

Then i split that file by "|", and after that i have 5 columns on my new df. beacause the biggest number of product bought by One client for that day is equal 5.

The Final DataFrame give :

Date     Conv    Product_1 product_2 ... product_n  price_1  price_2 ... price_n
01/2016  'Part'        A        B           C            15      20          30 
01/2016  'Pro'         B        D           C            10      10          20 
02/2016  'Part'        E        A           B            25      5           10 

I would like to transpose Variables "Product_1 ...product_n" and "price 1 ... price_n". And obtain a new df :

Date      Conv   Product   price 
01/2016  'Part'     A        15
01/2016  'Part'     B        20
01/2016  'Part'     C        30
01/2016  'Pro'      B        10
01/2016  'Pro'      D        10
01/2016  'Pro'      C        20
02/2016  'Part'     E        25
02/2016  'Part'     A         5
02/2016  'Part'     B        10

The difficulty is on transpose variable and copy the variable Date and conv.

I think with SAS we can obtain that with the code :

Proc transpose ; 
Data = DF;
VAR product_1-product_4 price1_price_4;
BY Date Conv;
COPY Date Conv;

But on Python I don't find an equivalent.

Someone Know how can I do please ?

I try with : df.transpose

But it's not the result that I want.

Upvotes: 1

Views: 1956

Answers (1)

jezrael
jezrael

Reputation: 862581

You can first select columns by list comprehension with startswith and then use pd.lreshape:

prods = ([col for col in df.columns if col.startswith('product_')])
prices = ([col for col in df.columns if col.startswith('price_')])

print (prods)
['product_1', 'product_2', 'product_n']
print (prices)
['price_1', 'price_2', 'price_n']

df1 = pd.lreshape(df, {'product' : prods, 'price' : prices}) 
print (df1)
     Conv     Date  price product
0  'Part'  01/2016     15       A
1   'Pro'  01/2016     10       B
2  'Part'  02/2016     25       E
3  'Part'  01/2016     20       B
4   'Pro'  01/2016     10       D
5  'Part'  02/2016      5       A
6  'Part'  01/2016     30       C
7   'Pro'  01/2016     20       C
8  'Part'  02/2016     10       B

EDIT by more specify question:

#new df1 from column product
df1 = (df['product'].str.split('|', expand=True))
#add prod_ to column names
prods = df1.columns = ['prod_' + str(col) for col in df1.columns] 

#new df2 from column Prices
df2 = (df['Prices'].str.split('|', expand=True))
#add part_ to column names
prices = df2.columns = ['part_' + str(col) for col in df2.columns]

#join all together
df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))

#reshape
print (pd.lreshape(df3, {'product' : prods, 'price' : prices}))
     conv     date price product
0  'part'  01/2016    15       A
1   'pro'  01/2016    10       D
2  'part'  01/2016    20       B
3   'pro'  01/2016    10       B
4  'part'  01/2016    30       C
5  'part'  01/2016    40       E
6  'part'  01/2016    50       F

Another solution with join:

#create dataframe and stack, drop level of multiindex
s1 = (df['product'].str.split('|', expand=True)).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'product'

s2 = (df['Prices'].str.split('|', expand=True)).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'price'

#remove original columns    
df = df.drop(['product','Prices'], axis=1)

#join series to dataframe    
df1 = (df.join(s1).reset_index(drop=True))
df2 = (df.join(s2).reset_index(drop=True))

#join all togehter
print (pd.concat([df1, df2[['price']]], axis=1))
      date    conv product price
0  01/2016  'part'       A    15
1  01/2016  'part'       B    20
2  01/2016  'part'       C    30
3  01/2016  'part'       E    40
4  01/2016  'part'       F    50
5  01/2016   'pro'       D    10
6  01/2016   'pro'       B    10

Timings:

In [598]: %timeit (a(df))
100 loops, best of 3: 10.6 ms per loop

In [599]: %timeit (b(df_a))
100 loops, best of 3: 14.1 ms per loop

Code for timings:

import pandas as pd

df = pd.DataFrame({'date': {0: '01/2016', 1: '01/2016'}, 
                   'conv': {0: "'part'", 1: "'pro'"}, 
                   'Prices': {0: '15|20|30|40|50', 1: '10|10'}, 
                   'product': {0: 'A|B|C|E|F', 1: 'D|B'}}, 
                    columns =['date','conv','product','Prices'])
df = pd.concat([df]*1000).reset_index(drop=True)

print (df)
df_a = df.copy()

def a(df):
    df1 = (df['product'].str.split('|', expand=True))
    prods = df1.columns = ['prod_' + str(col) for col in df1.columns] 

    df2 = (df['Prices'].str.split('|', expand=True))
    prices = df2.columns = ['part_' + str(col) for col in df2.columns]

    df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))

    return (pd.lreshape(df3, {'product' : prods, 'price' : prices}))


def b(df):
    s1 = (df['product'].str.split('|', expand=True)).stack()
    s1.index = s1.index.droplevel(-1)
    s1.name = 'product'

    s2 = (df['Prices'].str.split('|', expand=True)).stack()
    s2.index = s2.index.droplevel(-1)
    s2.name = 'price'

    df = df.drop(['product','Prices'], axis=1)

    df1 = (df.join(s1).reset_index(drop=True))
    df2 = (df.join(s2).reset_index(drop=True))

    return (pd.concat([df1, df2[['price']]], axis=1))

print (a(df))    
print (b(df_a))  

EDIT:

lreshape is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).

Possible solution is merging all 3 functions to one - maybe melt, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.

Upvotes: 3

Related Questions