Cypress
Cypress

Reputation: 357

fuzzy join with multiple conditions

I know there are questions with similar titles out there, but none of them really answers my question. I have a data frame as below. The "index" column is actually timestamp. Column A is how many tones of materials have been dumped to a crusher. Column B is the crushing rate at each timestamp. What I want to know is when a load of material (column A) will be crushed based on crushing rate (column B).

There are three possible scenarios.

  1. first load is crushed when second batch is loaded.
  2. first load is crushed prior to second load
  3. first load is not crushed when second load is added

I have tried to calculate the cumulative value of column A and B and used merge_asof to perform fuzzy join. But it doesn't work as expected as excessive crushing capacity is not stored. Only crushing rate after materials loaded should be considered.

A = {'index':range(1,11),'A':[300,0,0,400,0,0,0,0,150,0]}
B = {'index':range(1,11),'B':[102,103,94,120,145,114,126,117,107,100]}
A = pd.DataFrame(data=A)
B = pd.DataFrame(data=B)

Here is the expected outcome:

IndexA  A   IndexB  B_accumulate 
1      300  4       419
4      400  8       502
9      150  10      207

B_accumulate is running total of crushing rate(B), which is reset to 0 when a load of material is crushed (when B_accumlate>=A)

Upvotes: 3

Views: 224

Answers (4)

kantal
kantal

Reputation: 2407

I've simplified the structures, using Series instead of DataFrame, and the indexes start at zero. cumsum() and searchsorted() are applied.

Load = pd.Series([300,0,0,400,50,0,0,0,150,0])  # aka 'A'
Rate = pd.Series([102,103,94,120,145,114,126,117,107,100])  # aka 'B'

# Storage for the result:
H=[]    # [ (indexLoad, Load, indexRate, excess) ... ]

# Find the 1st non 0 load:
load1_idx= len(Load)

for lix in range(len(Load)):
    a= Load[lix]
    if a!=0:
        csumser= Rate.cumsum()
        rix= csumser.searchsorted(a)
        excess= csumser[rix]-a
        H.append( (lix,a,rix,excess) )
        load1_idx=lix
        break

# Processing
for lix in range(load1_idx+1,len(Load)):

    a=Load[lix]
    if a==0:
        continue

    last_rix= H[-1][-2]
    csumser[last_rix:]= Rate[last_rix:]
    if lix==last_rix:
        csumser[lix]= H[-1][-1] # excess

    csumser[last_rix:]= csumser[last_rix:].cumsum()

    rix= csumser[last_rix:].searchsorted(a)
    rix+= last_rix
    excess= csumser[rix]-a
    H.append( (lix,a,rix,excess) )       

df= pd.DataFrame(H, columns=["indexLoad","Load","indexRate","rate_excess"])
print(df)

   indexLoad  Load  indexRate  rate_excess
0          0   300          3          119
1          3   400          6          104
2          4    50          6           76
3          8   150          7           93

Upvotes: 1

user10325516
user10325516

Reputation:

Possible approach. The problem is splitted into 2 parts - get actual amount of material (which cannot be negative) and analyze loads (groups of rows when there is any amount of material to crush during current time step).

import numpy as np
import pandas as pd

def get_load(df):
    """ get loaded material minus crushed material """
    current_load = (df['A'] - df['B']).values
    if current_load[0] < 0:
        current_load[0] = 0
    for idx in range(1, len(current_load)):
        correct_value = current_load[idx - 1] + current_load[idx]
        if correct_value < 0:
            current_load[idx] = 0
        else:
            current_load[idx] = correct_value
    return current_load

def get_work_load_chunk_stat(df):
    """ get chunks when material actually crushing """
    if df['load'].sum() == 0:
        return
    ans = pd.DataFrame(
        {'indexA': [df.iloc[0, :]['indexA']],
         'total_load': [df['A'].sum()],
         'loads_qty': [df[df['A'] > 0]['A'].count()],
         'indexB': [df.iloc[-1, :]['indexB']],
         'total_work': [df['B'].sum()]})
    return ans

if __name__ == '__main__':

    A = {'indexA': range(22),
         'A': [0, 300, 0, 0, 0, 0, 400, 0, 100, 0, 0, 0, 300, 0, 0, 0, 0, 400, 0, 100, 0, 0]}
    B = {'indexB': range(22),
         'B': [99, 102, 103, 94, 120, 145, 114, 126, 117, 107, 87, 99, 102, 103, 94, 120, 145, 114, 126, 117, 107, 87]}

    data = pd.concat([pd.DataFrame(data=A), pd.DataFrame(data=B)], axis=1)
    data['load'] = get_load(data)
    data['load_check'] = np.where(data['load'] == 0, 1, 0)
    data['load_check'] = data['load_check'].shift(fill_value=0).cumsum()
    # print(data)

    result = (
        data
        .groupby('load_check')
        .apply(get_work_load_chunk_stat)
        .reset_index(drop=True))
    print(result)

Output:

   indexA  total_load  loads_qty  indexB  total_work
0       1         300          1       4         419
1       6         500          2      10         551
2      12         300          1      15         419
3      17         500          2      21         551

Upvotes: 1

Rean
Rean

Reputation: 56

create the DF combined A & B:

A = {'index':range(1,11),'A':[300,0,400,0,0,0,0,0,100,0]}
B = {'index':range(1,11),'B':[102,103,94,120,145,114,126,117,107,87]}
df_A = pd.DataFrame(data=A)
df_B = pd.DataFrame(data=B)
df_com = pd.concat([df_A,df_B],axis=1).drop('index',axis=1)

create index:

indexA = list(df_com.A[df_com.A.ne(0)].index + 1)
indexB = np.array(indexA) - 2
indexB = np.append(indexB[1:],(len(df_com)-1))

replace 0 with ffill() in col A:

df_com['A'] = df_com.A.replace(0,method='pad')

groupby and add index columns:

df_new =df_com.groupby("A",sort=False).apply(lambda x:x.B.shift(1).sum()).reset_index()
df_new['indexA'] = indexA
df_new['indexB'] = indexB
df_new

Upvotes: 1

WolfgangK
WolfgangK

Reputation: 993

Here is a very verbose solution which I hope is generalisable to your full data. I am sure you can simplify it.

C = A.join(B.set_index('index'), on='index')

C['A_filled'] = C['A'].replace(to_replace=0, method='ffill')
C['cumul_load'] = C['A'].cumsum()
C['load_number'] = C.groupby('cumul_load').ngroup() + 1
C['B_accum'] = C.groupby('load_number')['B'].cumsum()
C['A_fully_crushed'] = C['B_accum'] > C['A_filled']
C['first_index_fully_crushed'] = C.groupby('load_number')['A_fully_crushed'].cumsum() == 1

indexA_ = C['index'][C['A'] > 0].tolist()
A_ = C['A'][C['A'] > 0].tolist()
indexB_ = C['index'][C['first_index_fully_crushed'] == True].tolist()
B_accumulate_ = C['B_accum'][C['first_index_fully_crushed'] == True].tolist()
result = pd.DataFrame({'indexA': indexA_, 'A': A_, 'indexB': indexB_, 'B_accumulate': B_accumulate_})

This yields

   indexA    A  indexB  B_accumulate
0       1  300       4           419
1       6  400       9           464

Upvotes: 1

Related Questions