How to filter multiple columns for value or wildcard?

Question

I have a dataframe with model information (df_model, ~700 rows, ~50 types) and a dataframe with component information (df_components, ~400.000 rows).

df_model: a list of model types, each model type has several rows with specific components:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN
1   E2      W           W           IT      NaN     NaN     AC-8
2   E2      P           13.3        NaN     NaN     NaN     NaN
3   P1      C/C         C           IT      NaN     TU      NaN
4   P1      GR          G           NAI     XSM     NaN     NaN

df_component: the list of components available + delivery time per component:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9
2   W           W           IT      NaN     NaN     AC-8    CM3             3
3   P           13.3        NaN     NaN     NaN     NaN     CM4             13
4   P           15.3        NaN     NaN     NaN     NaN     CM5             20
5   B           BR          NaN     KRL     NaN     NaN     CM6             5

I want to write a function that returns - for a specific model type - the delivery times for each matching component from the model_components + the minimal delivery time.

E.g.: for model type "E2" there are 2 matching components for "C/C C IT NaN GR NaN". One has a delivery time of 8, one of 9. 8 < 9 -> 9 gets removed because we want the best delivery time per model component.

Problem: If a model component has "NaN" in a column it's a wildcard = whatever is in the corresponding component column should match (I could change np.nan for wildcards to anything else if that helps).

E.g. the first row of df_model:

    type    mainclass   subclass    param1  param2  param3  param4
0   E2      C/C         C           IT      NaN     GR      NaN

Column "param2" = "NaN" so from df_component these 2 rows match because column "param2" does not matter:

    mainclass   subclass    param1  param2  param3  param4  component_name  delivery_time
0   C/C         C           IT      X       GR      NaN     CM1             8
1   C/C         C           IT      NaN     GR      RAD     CM2             9

So for model type "E2" the end result should be a dict/JSON with:

{
    "leadtime_min": 13,
    "CM1": 8,
    "CM3": 3,
    "CM4": 13,
}

My approach for a function so far:

select given model type
iterate over each model component, searching the matching components
get minimum delivery time for each components (if there are several matching components. If several matching components have the same delivery time they all need to be added to the result JSON)

My query code in my function returns all rows and not only the matching ones and I need that solved.

My code so far:

df_model = pd.DataFrame(
    {
        "type": ["E2", "E2", "E2", "P1", "P1"],
        "mainclass": ["C/C", "W", "P", "C/C", "GR"],
        "subclass": ["C", "W", "13.3", "C", "G"],
        "param1": ["IT", "IT", np.nan, "IT", "NAI"],
        "param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
        "param3": ["GR", np.nan, np.nan, "TU", np.nan],
        "param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
    }
)

df_components = pd.DataFrame(
    {
        "mainclass": [
            "C/C",
            "C/C",
            "W",
            "P",
            "P",
            "B",
        ],
        "subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
        "param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
        "param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
        "param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
        "param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
        "component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
        "delivery_time": [8, 9, 3, 13, 20, 5],
    }
)



def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)

    model = model.query("type == @model_type")
    fitting_components = {"delivery_time_min": sys.maxsize}
    delivery_times_all = []

    for model_component in df_model.itertuples():
        model_mainclass = model_component.mainclass
        model_subclass = model_component.subclass
        model_param1 = model_component.param1
        model_param2 = model_component.param2
        model_param3 = model_component.param3
        model_param4 = model_component.param4

        checked_components = components.query(
            """
            mainclass == @model_mainclass and \
            subclass == @model_subclass \
            and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
            and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
            and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
            and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
            """
        )

        component_descriptions = checked_components"component_name"    ].tolist()
        delivery_times = checked_components["delivery_time"].tolist()

        delivery_time_min = min(delivery_times )
        delivery_times_all.append(delivery_time_min )

        for component_description, delivery_time in zip(component_descriptions, delivery_times ):
            if delivery_time == delivery_time_min :
                fitting_components[component_description] = delivery_time 

    fitting_components["delivery_time_min "] = max(delivery_times_all )
    print(fitting_components)

get_delivery_time_best(model_type="E2", model=df_model, components=df_components)

But that returns all rows from df_components and not the 2 matching rows?

How to filter multiple columns for value or wildcard?

Answers (1)

`Merge` and `boolean indexing`

Explanations

Related Questions

How to filter multiple columns for value or wildcard?

Answers (1)

Merge and boolean indexing

Explanations

Related Questions

`Merge` and `boolean indexing`