Reputation: 83
I’m trying to sort some sales data per product day of sale and product ID, and then I would like to compute some statistics with pandas. Is there an efficient way to do this? My dataset has millions of rows.
The dataset looks like this (df1, 3.000.000 + rows): -----------------------------------------------------------------------------
|productID |productCategory |expiryDate |Price |Currency |quantitySold| daySold|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |5675 |18feb2017 12:45:17|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |654 |31jan2017 12:45:17|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |300 |18feb2017 09:17:15|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |200 |31jan2017 15:30:35|
|Rt4564 |fdgdf |13jun2018 00:00:00 |45 |USD |1544 |31feb2017 13:25:31|
|Fdgd4 |Ergdgf |15sep2020 00:00:00 |125 |USD |4487 |18mar2017 09:17:15|
|Sd23454 |sdfdsr |17mar2018 00:00:00 |39 |USD |7895 |31aug2017 15:30:35|
I would like to sort compute some simple statistics per productID, per day. So I think my code should first order the rows per day and then per product. Then it should compute the statistics and add them to the table.
The result in this example is(df2):
|productID |productCategory |expiryDate |Price |Currency |quantitySold |daySold |volSTD |totalVol |totalRevenue|
------------------------------------------------------------------------**
|Sd23454 |sdfdsr |17mar2018 00:00:00 39 |USD |654 |31jan2017 12:45:17 |321.02 |854 |33306 |
|Fdgd4 |Ergdgf |15sep2020 00:00:00 125 |USD |300 |31jan2017 15:30:35 |0 |300 |37500 |
|Fdgd4 |Ergdgf |15sep2020 00:00:00 125 |USD |5675 |18feb2017 12:45:17 |840.04 |10162 |1270250|
|Rt4564 |fdgdf |13jun2018 00:00:00 45 |USD |1544 |31feb2017 13:25:31 |0 |544 |69480 |
|Sd23454 |sdfdsr |17mar2018 00:00:00 39 |USD |7895 |31aug2017 15:30:35 |0 |7895 |307905 |
I use a nested for loop in pandas which gives the expected result but it does take way to long (a couple of hours). I am looking for a fast way to get this result.
My code (probably on of the worst you’ve ever seen):
uniqueDays = df1.daySold.unique()
numberOfDays = df1.shape[0]
df_results = pd.Dataframe(columns=[‘productID’, ‘productCategory’, ‘expiryDate Price’, ‘Currency’, ‘quantitySold’, ‘daySold’, ‘volSTD’, ‘totalVol’, ‘totalRevenue’])
For i in range(0, numberOfDays):
temp1 = df1.loc[df1[‘daySold’]== uniqueDays[i]]
uniqueID = temp1.productID.unique()
NumberOfUniqueID = uniqueID.shape[0]
for j in range(0, NumberOfUniqueID):
temp2 = temp1.loc[temp1[‘daySold’]== uniqueID[j]
volSTD = temp2.quantitySold.std()
totalVol = temp2.quantitySold.sum()
totalRevenue = temp2.quantitySold.dot(temp2.price)
temp3 = temp2.iloc[0] # it does not matter which row I pick
temp3[‘volSTD’] = volSTD
temp3[‘totalVol’] = totalVol
temp3[‘totalRevenue’] = totalRevenue
df_results = df_results.append(temp3)
This gives me the results I want I want, but it is way too slow. In particular adding the columns (volSTD, totalVol and totalRevenue) to temp3 and appending temp3 to df_results take a combined 81.3% of the processing time.
Does anyone have a faster way to do this? Using vectors? Or filling an existing dataframe instead of appending?
Thanks
Upvotes: 0
Views: 679
Reputation: 3689
How about groupby
? It handles, so to speak, the iterations much more efficiently than loops and in much shorter and readable code. You would group on daySold
and productID
. This is obviously mock data, but you would want to turn your daySold
into a datetime
object first so you can easily group on it - I just kept the day, but you could keep the time if needed:
df.daySold=pd.to_datetime(df.daySold.apply(lambda x: x[:9]),format="%d%b%Y")
Then it is just a one-liner. With the groupby
object you can pass a number of different aggregation calls.
df.groupby(['daySold','productID']).agg({'quantitySold':[sum,np.std],'Price':[sum,np.std]})
quantitySold Price
sum std sum std
daySold productID
2017-01-31 Sd23454 854 321.026479 78 0.0
2017-02-13 Rt4564 1544 NaN 45 NaN
2017-02-18 Fdgd4 5975 3800.698949 250 0.0
2017-03-18 Fdgd4 4487 NaN 125 NaN
2017-08-30 Sd23454 7895 NaN 39 NaN
EDIT:
You can use the groupby object to apply all manner of functions, off the shelf ones and ones you define yourself.
So you could do a dot product, requiring two columns / arrays of a dataframe, like so:
def dotter(df):
return np.sum(df.quantitySold*df.Price)
## or if you want to use numpy--may be faster for large datasets:
#return np.dot(df.quantitySold,df.Price)
Call it by using apply method of groupby object:
df.groupby(['daySold','productID']).apply(dotter)
daySold productID
2017-01-31 Sd23454 33306
2017-02-13 Rt4564 69480
2017-02-18 Fdgd4 746875
2017-03-18 Fdgd4 560875
2017-08-30 Sd23454 307905
dtype: int64
Upvotes: 1