Kees
Kees

Reputation: 471

Splitting variables in pandas dataframe into new rows with repeat key

I think my code is close, but really screwed up. Basically, I have rows of sales data. One of the variables is a receipt number, while another is a string containing Units and Products as a string.

I'd like a new row for each unique product, retaining the relevant receipt number for that row. For instance, an entry for Receipt A with products L and Q should become two separate entries: Receipt A with L, and Receipt A with Q. So I'm trying to split the string into variables and also create a new entry under the same receipt identifier.

Anyway, but the time I'm done I've accidentally duplicated some of the entries. Any help with a) making it generate the right output and b) prettier/simpler code? (I also don't need the old variable Description).

Many thanks

import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ["9/26/17 2:33 PM", "9/26/17 2:23 PM", "9/26/17 2:22 PM"], "Receipt number": ["1-1002","1-1001","1-1000"], "Description": ["1 x Capacino, 2 x Americano","1 x Americano","1 x Latte"]})
df

df2 = df['Description'].str.split(',').apply(pd.Series, 1).stack()
df2.index = df2.index.droplevel(-1)
df2.name = 'Product' 
df = df.join(df2)
df.join(df['Product'].str.split(' x ', 1, expand=True).rename(columns={0:'Units',   1:'Product Name'}))

Upvotes: 0

Views: 83

Answers (1)

Scratch'N'Purr
Scratch'N'Purr

Reputation: 10429

Well, you are close. Except for the last line. Here's what I did, following along with your code:

import pandas as pd
import numpy as np
df = pd.DataFrame({"Date": ["9/26/17 2:33 PM", "9/26/17 2:23 PM", "9/26/17 2:22 PM"], "Receipt number": ["1-1002","1-1001","1-1000"], "Description": ["1 x Capacino, 2 x Americano","1 x Americano","1 x Latte"]})

df2 = df['Description'].str.split(',').apply(pd.Series, 1).stack()
df2.index = df2.index.droplevel(-1)
df2.name = 'Product' 
df = df.join(df2)

# Here is where I diverge
df['Units'] = df['Product'].apply(lambda x: int(x.split(' x ')[0]))
df['Product'] = df['Product'].apply(lambda x: x.split(' x ')[-1])
df = df.drop('Description', axis=1)

Result

              Date Receipt number    Product  Units
0  9/26/17 2:33 PM         1-1002   Capacino      1
0  9/26/17 2:33 PM         1-1002  Americano      2
1  9/26/17 2:23 PM         1-1001  Americano      1
2  9/26/17 2:22 PM         1-1000      Latte      1

Upvotes: 1

Related Questions