Reputation: 49
I have a Dataframe from one file, i have products and price together. I have on file per day who registered all the products bought by customer. So the length of columns depends of the biggest number of product bought by One client. On the start, i have a file like that :
date conv product Prices
01/2016 'part ' A|B|C|E|F 15|20|30|40|50
01/2016 'Pro' D|B 10|10
Then i split that file by "|", and after that i have 5 columns on my new df. beacause the biggest number of product bought by One client for that day is equal 5.
The Final DataFrame give :
Date Conv Product_1 product_2 ... product_n price_1 price_2 ... price_n
01/2016 'Part' A B C 15 20 30
01/2016 'Pro' B D C 10 10 20
02/2016 'Part' E A B 25 5 10
I would like to transpose Variables "Product_1 ...product_n" and "price 1 ... price_n". And obtain a new df :
Date Conv Product price
01/2016 'Part' A 15
01/2016 'Part' B 20
01/2016 'Part' C 30
01/2016 'Pro' B 10
01/2016 'Pro' D 10
01/2016 'Pro' C 20
02/2016 'Part' E 25
02/2016 'Part' A 5
02/2016 'Part' B 10
The difficulty is on transpose variable and copy the variable Date and conv.
I think with SAS
we can obtain that with the code :
Proc transpose ;
Data = DF;
VAR product_1-product_4 price1_price_4;
BY Date Conv;
COPY Date Conv;
But on Python I don't find an equivalent.
Someone Know how can I do please ?
I try with : df.transpose
But it's not the result that I want.
Upvotes: 1
Views: 1956
Reputation: 862581
You can first select columns by list
comprehension with startswith
and then use pd.lreshape
:
prods = ([col for col in df.columns if col.startswith('product_')])
prices = ([col for col in df.columns if col.startswith('price_')])
print (prods)
['product_1', 'product_2', 'product_n']
print (prices)
['price_1', 'price_2', 'price_n']
df1 = pd.lreshape(df, {'product' : prods, 'price' : prices})
print (df1)
Conv Date price product
0 'Part' 01/2016 15 A
1 'Pro' 01/2016 10 B
2 'Part' 02/2016 25 E
3 'Part' 01/2016 20 B
4 'Pro' 01/2016 10 D
5 'Part' 02/2016 5 A
6 'Part' 01/2016 30 C
7 'Pro' 01/2016 20 C
8 'Part' 02/2016 10 B
EDIT by more specify question:
#new df1 from column product
df1 = (df['product'].str.split('|', expand=True))
#add prod_ to column names
prods = df1.columns = ['prod_' + str(col) for col in df1.columns]
#new df2 from column Prices
df2 = (df['Prices'].str.split('|', expand=True))
#add part_ to column names
prices = df2.columns = ['part_' + str(col) for col in df2.columns]
#join all together
df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))
#reshape
print (pd.lreshape(df3, {'product' : prods, 'price' : prices}))
conv date price product
0 'part' 01/2016 15 A
1 'pro' 01/2016 10 D
2 'part' 01/2016 20 B
3 'pro' 01/2016 10 B
4 'part' 01/2016 30 C
5 'part' 01/2016 40 E
6 'part' 01/2016 50 F
Another solution with join
:
#create dataframe and stack, drop level of multiindex
s1 = (df['product'].str.split('|', expand=True)).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'product'
s2 = (df['Prices'].str.split('|', expand=True)).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'price'
#remove original columns
df = df.drop(['product','Prices'], axis=1)
#join series to dataframe
df1 = (df.join(s1).reset_index(drop=True))
df2 = (df.join(s2).reset_index(drop=True))
#join all togehter
print (pd.concat([df1, df2[['price']]], axis=1))
date conv product price
0 01/2016 'part' A 15
1 01/2016 'part' B 20
2 01/2016 'part' C 30
3 01/2016 'part' E 40
4 01/2016 'part' F 50
5 01/2016 'pro' D 10
6 01/2016 'pro' B 10
Timings:
In [598]: %timeit (a(df))
100 loops, best of 3: 10.6 ms per loop
In [599]: %timeit (b(df_a))
100 loops, best of 3: 14.1 ms per loop
Code for timings:
import pandas as pd
df = pd.DataFrame({'date': {0: '01/2016', 1: '01/2016'},
'conv': {0: "'part'", 1: "'pro'"},
'Prices': {0: '15|20|30|40|50', 1: '10|10'},
'product': {0: 'A|B|C|E|F', 1: 'D|B'}},
columns =['date','conv','product','Prices'])
df = pd.concat([df]*1000).reset_index(drop=True)
print (df)
df_a = df.copy()
def a(df):
df1 = (df['product'].str.split('|', expand=True))
prods = df1.columns = ['prod_' + str(col) for col in df1.columns]
df2 = (df['Prices'].str.split('|', expand=True))
prices = df2.columns = ['part_' + str(col) for col in df2.columns]
df3 = (pd.concat([df[['date','conv']], df1, df2], axis=1))
return (pd.lreshape(df3, {'product' : prods, 'price' : prices}))
def b(df):
s1 = (df['product'].str.split('|', expand=True)).stack()
s1.index = s1.index.droplevel(-1)
s1.name = 'product'
s2 = (df['Prices'].str.split('|', expand=True)).stack()
s2.index = s2.index.droplevel(-1)
s2.name = 'price'
df = df.drop(['product','Prices'], axis=1)
df1 = (df.join(s1).reset_index(drop=True))
df2 = (df.join(s2).reset_index(drop=True))
return (pd.concat([df1, df2[['price']]], axis=1))
print (a(df))
print (b(df_a))
EDIT:
lreshape
is now undocumented, but is possible in future will by removed (with pd.wide_to_long too).
Possible solution is merging all 3 functions to one - maybe melt
, but now it is not implementated. Maybe in some new version of pandas. Then my answer will be updated.
Upvotes: 3