Aashish Gahlawat
Aashish Gahlawat

Reputation: 449

filter pandas rows by other dataframe columns

I have 3 dataframes already sorted with date and p_id and with no null values as:

First DataFrame

df1 = pd.DataFrame([['2018-07-05',8.0,1],
                    ['2018-07-15',1.0,1],
                    ['2018-08-05',2.0,1],
                    ['2018-08-05',2.0,2]],
      columns=["purchase_date", "qty", "p_id"])

Second DataFrame

df2 = pd.DataFrame([['2018-07-15',2.0,1],
                    ['2018-08-04',7.0,1],
                    ['2018-08-15',1.0,2]], 
      columns=["sell_date", "qty", "p_id"])

Third DataFrame

df3 = pd.DataFrame([['2018-07-25',1.0,1],
                    ['2018-08-15',1.0,1]],
      columns=["expired_date", "qty", "p_id"])

dataframe looks like:

1st: (Holds Purchase details)

    purchase_date   qty     p_id
0   2018-07-05      8.0     1
1   2018-07-15      1.0     1
2   2018-08-05      2.0     1
3   2018-08-05      2.0     2

2nd: (Holds Sales Details)

    sell_date   qty    p_id
0   2018-07-15  2.0    1
1   2018-08-04  7.0    1
2   2018-08-15  1.0    2

3rd: (Holds Expiry Details)

    expired_date    qty   p_id
0   2018-07-25      1.0   1
1   2018-08-15      1.0   1

Now What I want to do is find when the product that has expired was bought
following FIFO (product first purchased will expire first)



Explanation: Consider product with id 1

By date 2018-07-15

We had 8+1 purchased quantity and -2 sold quantity i.e. total of 8+1-2 quantity in stock , -ve sign signify quantity deduction

By date 2018-07-25

1 quantity expired so first entry for our new when_product_expired dataframe will be:

purchase_date     expired_date    p_id
2018-07-05        2018-07-25      1


And then for next expiry entry

By date 2018-08-04

7 quantity were sold out so current quantity will be 8+1-2-7 = 0

By date 2018-08-05

2 quantity were bought so current quantity is 0+2

By date 2018-08-15

1 quantity expired

So a new and final entry will be:

purchase_date     expired_date    p_id
2018-07-05        2018-07-25      1
2018-08-05        2018-08-15      1

This time the product expired was one that was purchased on 2018-07-25

Actually I have date time, so purchase and sell time will never be equal (you may assume), also before selling and expire, there will always be some quantity of product in stock, i.e. data is consistent
And Thank you in advance :-)

Updated

What by now I am thinking is rename all date fields to same field name and append purchase, sell, expired dataframe with negative sign, but that won't help me

df2.qty = df2.qty*-1
df3.qty=df3.qty*-1
new = pd.concat([df1,df2, df3],sort=False)
      .sort_values(by=["purchase_date"],ascending=True)
      .reset_index(drop=True)

Upvotes: 0

Views: 181

Answers (1)

maow
maow

Reputation: 2887

What you essentially want is this FIFO list of items in stock. In my experience pandas is not the right tool to relate different rows to each other. The workflow should be split-apply-combine. If you split it and don't really see a way how to puzzle it back together, it may be a ill-formulated problem. You can still get a lot done with groupby, but this is something I would not try to solve with some clever trick in pandas. Even if you make it work, it will be hell to maintain.

I don't know how performance critical your problem is (i.e. how large are your Dataframes). If its just a few 10000 entries you can just explicitly loop over the pandas rows (warning: this is slow) and build the fifo list by hand.

I hacked together some code for this. The DateFrame you proposed is in there. I loop over all rows and do bookkeeping how many items are in stock. This is done in a queue q which contains an element for each item and the element convienently is the purchase_date.

import queue

import pandas as pd

from pandas import Series, DataFrame

# modified (see text)
df1 = pd.DataFrame([['2018-07-05',8.0,1],
                    ['2018-07-15',3.0,1],
                    ['2018-08-05',2.0,1],
                    ['2018-08-05',2.0,2]],
      columns=["purchase_date", "qty", "p_id"])

df2 = pd.DataFrame([['2018-07-15',2.0,1],
                    ['2018-08-04',7.0,1],
                    ['2018-08-15',1.0,2]], 
      columns=["sell_date", "qty", "p_id"])

df3 = pd.DataFrame([['2018-07-25',1.0,1],
                    ['2018-08-15',1.0,1]],
      columns=["expired_date", "qty", "p_id"])


df1 = df1.rename(columns={'purchase_date':'date'})

df2 = df2.rename(columns={'sell_date':'date'})

df3 = df3.rename(columns={'expired_date' : 'date'})

df3['qty'] *= -1

df2['qty'] *= -1

df = pd.concat([df1,df2])\
      .sort_values(by=["date"],ascending=True)\
      .reset_index(drop=True)

# Necessary to distinguish between sold and expried items while looping
df['expired'] = False
df3['expired'] = True

df = pd.concat([df,df3])\
      .sort_values(by=["date"],ascending=True)\
      .reset_index(drop=True)

#date  qty  p_id  expired
#7-05  8.0     1    False
#7-15  1.0     1    False
#7-15 -2.0     1    False
#7-25 -1.0     1     True
#8-04 -7.0     1    False
#8-05  2.0     1    False
#8-05  2.0     2    False
#8-15 -1.0     2    False
#8-15 -1.0     1     True

# Iteratively build up when_product_expired
when_product_expired = []

# p_id hardcoded here
p_id = 1

# q contains purchase dates for all individual items 'currently' in stock
q = queue.Queue()

for index, row in df[df['p_id'] == p_id].iterrows():
    # if items are bought, put as many as 'qty' into q
    if row['qty'] > 0:
        for tmp in range(int(round(row['qty']))):
            date = row['date']
            q.put(date)
    # if items are sold or expired, remove as many from q. 
    # if expired additionaly save purchase and expiration date into when_product_expired
    elif row['qty'] < 0:
        for tmp in range(int(round(-row['qty']))):
            purchase_date = q.get()
            if row['expired']:
                print 'item p_id 1 was bought on', purchase_date
                when_product_expired.append([purchase_date, row['date'], p_id])

when_product_expired = DataFrame(when_product_expired, columns=['purchase_date', 'expired_date', 'p_id'])

A few remarks:

  • I relied on your guarentee that

    before selling and expire, there will always be some quantity of product in stock

    This is not given for your example DataFrames. Before 2018-07-25 there are 9 items with p_id 1 bought and 9 sold. There is nothing in stock that could expire. I modified df1 so that 11 pieces are bought.

  • If this assumption is violated Queue will try to get an item that is not there. On my machine that leads to an endless loop. You might want to catch the exception.
  • The queue is not in the least efficiently implemented. If many items are in stock, there will be a lot of data doubling.
  • You can generalize that to more p_id's by either putting everything into a function and .groupby('p_id').apply(function) or loop over df['p_id'].unique()

So while this is not scalable solution, I hope it helps you a bit. Good look

Upvotes: 1

Related Questions