Vega
Vega

Reputation: 2929

How to filter multiple columns for value or wildcard?

I have a dataframe with model information (df_model, ~700 rows, ~50 types) and a dataframe with component information (df_components, ~400.000 rows).

df_model: a list of model types, each model type has several rows with specific components:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN
1   E2      W           W           IT      NaN     NaN     AC-8
2   E2      P           13.3        NaN     NaN     NaN     NaN
3   P1      C/C         C           IT      NaN     TU      NaN
4   P1      GR          G           NAI     XSM     NaN     NaN

df_component: the list of components available + delivery time per component:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9
2   W           W           IT      NaN     NaN     AC-8    CM3             3
3   P           13.3        NaN     NaN     NaN     NaN     CM4             13
4   P           15.3        NaN     NaN     NaN     NaN     CM5             20
5   B           BR          NaN     KRL     NaN     NaN     CM6             5

I want to write a function that returns - for a specific model type - the delivery times for each matching component from the model_components + the minimal delivery time.

E.g.: for model type "E2" there are 2 matching components for "C/C C IT NaN GR NaN". One has a delivery time of 8, one of 9. 8 < 9 -> 9 gets removed because we want the best delivery time per model component.

Problem: If a model component has "NaN" in a column it's a wildcard = whatever is in the corresponding component column should match (I could change np.nan for wildcards to anything else if that helps).

E.g. the first row of df_model:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN

Column "param2" = "NaN" so from df_component these 2 rows match because column "param2" does not matter:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9

So for model type "E2" the end result should be a dict/JSON with:

{
    "leadtime_min": 13,
    "CM1": 8,
    "CM3": 3,
    "CM4": 13,
}

My approach for a function so far:

  1. select given model type
  2. iterate over each model component, searching the matching components
  3. get minimum delivery time for each components (if there are several matching components. If several matching components have the same delivery time they all need to be added to the result JSON)

My query code in my function returns all rows and not only the matching ones and I need that solved.

My code so far:

df_model = pd.DataFrame(
    {
        "type": ["E2", "E2", "E2", "P1", "P1"],
        "mainclass": ["C/C", "W", "P", "C/C", "GR"],
        "subclass": ["C", "W", "13.3", "C", "G"],
        "param1": ["IT", "IT", np.nan, "IT", "NAI"],
        "param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
        "param3": ["GR", np.nan, np.nan, "TU", np.nan],
        "param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
    }
)

df_components = pd.DataFrame(
    {
        "mainclass": [
            "C/C",
            "C/C",
            "W",
            "P",
            "P",
            "B",
        ],
        "subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
        "param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
        "param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
        "param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
        "param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
        "component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
        "delivery_time": [8, 9, 3, 13, 20, 5],
    }
)



def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)

    model = model.query("type == @model_type")
    fitting_components = {"delivery_time_min": sys.maxsize}
    delivery_times_all = []

    for model_component in df_model.itertuples():
        model_mainclass = model_component.mainclass
        model_subclass = model_component.subclass
        model_param1 = model_component.param1
        model_param2 = model_component.param2
        model_param3 = model_component.param3
        model_param4 = model_component.param4

        checked_components = components.query(
            """
            mainclass == @model_mainclass and \
            subclass == @model_subclass \
            and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
            and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
            and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
            and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
            """
        )

        component_descriptions = checked_components"component_name"    ].tolist()
        delivery_times = checked_components["delivery_time"].tolist()

        delivery_time_min = min(delivery_times )
        delivery_times_all.append(delivery_time_min )

        for component_description, delivery_time in zip(component_descriptions, delivery_times ):
            if delivery_time == delivery_time_min :
                fitting_components[component_description] = delivery_time 

    fitting_components["delivery_time_min "] = max(delivery_times_all )
    print(fitting_components)

get_delivery_time_best(model_type="E2", model=df_model, components=df_components)

But that returns all rows from df_components and not the 2 matching rows?

Upvotes: 1

Views: 318

Answers (1)

Shubham Sharma
Shubham Sharma

Reputation: 71687

Merge and boolean indexing

merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])

l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']

m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)

Explanations

Merge the dataframe df_components with df_model on columns 'mainclass', 'subclass' and add a suffix _r to the right df to uniquely identify the columns from right dataframe

>>> merged

  mainclass subclass   param1 param2 param3 param4 param1_r  param2_r param3_r  param4_r
0        1a       1b  vendor1  spec1    id1    NaN  vendor1       NaN      id1       NaN
1        1a       1b  vendor1    NaN    id1  rout3  vendor1       NaN      id1       NaN
2        1a       1b  vendor1    NaN    id2    NaN  vendor1       NaN      id1       NaN
3        1a       1b      NaN    NaN    NaN    NaN  vendor1       NaN      id1       NaN
4        1a       1b  vendor2    NaN    id1    NaN  vendor1       NaN      id1       NaN
5        1a       1b  vendor2    NaN    NaN    NaN  vendor1       NaN      id1       NaN

Filter the param like columns from the left and right dataframe

>>> l

    param1 param2 param3 param4
0  vendor1  spec1    id1    NaN
1  vendor1    NaN    id1  rout3
2  vendor1    NaN    id2    NaN
3      NaN    NaN    NaN    NaN
4  vendor2    NaN    id1    NaN
5  vendor2    NaN    NaN    NaN


>>> r

  param1_r  param2_r param3_r  param4_r
0  vendor1       NaN      id1       NaN
1  vendor1       NaN      id1       NaN
2  vendor1       NaN      id1       NaN
3  vendor1       NaN      id1       NaN
4  vendor1       NaN      id1       NaN
5  vendor1       NaN      id1       NaN

Create a boolean mask by comparing the param like columns in left dataframe to the corresponding param like columns in right dataframe, Similarly create another boolean mask by checking the right dataframe for NaN values

>>> l.values == r.values

array([[ True, False,  True, False],
       [ True, False,  True, False],
       [ True, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False]])

>>> r.isna().values

array([[False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True],
       [False,  True, False,  True]])

Take the logical or of the above masks and reduce the resulting mask with all along axis=1

>>> m

array([ True,  True, False, False, False, False])

Now filter the rows in the df_components with the help of mask m

>>> merged[m].drop(r.columns, 1)

  mainclass subclass   param1 param2 param3 param4
0        1a       1b  vendor1  spec1    id1    NaN
1        1a       1b  vendor1    NaN    id1  rout3

Upvotes: 3

Related Questions