Reputation: 2929
I have a dataframe with model information (df_model, ~700 rows, ~50 types) and a dataframe with component information (df_components, ~400.000 rows).
df_model: a list of model types, each model type has several rows with specific components:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
1 E2 W W IT NaN NaN AC-8
2 E2 P 13.3 NaN NaN NaN NaN
3 P1 C/C C IT NaN TU NaN
4 P1 GR G NAI XSM NaN NaN
df_component: the list of components available + delivery time per component:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
2 W W IT NaN NaN AC-8 CM3 3
3 P 13.3 NaN NaN NaN NaN CM4 13
4 P 15.3 NaN NaN NaN NaN CM5 20
5 B BR NaN KRL NaN NaN CM6 5
I want to write a function that returns - for a specific model type - the delivery times for each matching component from the model_components + the minimal delivery time.
E.g.: for model type "E2" there are 2 matching components for "C/C C IT NaN GR NaN". One has a delivery time of 8, one of 9. 8 < 9 -> 9 gets removed because we want the best delivery time per model component.
Problem: If a model component has "NaN" in a column it's a wildcard = whatever is in the corresponding component column should match (I could change np.nan for wildcards to anything else if that helps).
E.g. the first row of df_model:
type mainclass subclass param1 param2 param3 param4
0 E2 C/C C IT NaN GR NaN
Column "param2" = "NaN" so from df_component these 2 rows match because column "param2" does not matter:
mainclass subclass param1 param2 param3 param4 component_name delivery_time
0 C/C C IT X GR NaN CM1 8
1 C/C C IT NaN GR RAD CM2 9
So for model type "E2" the end result should be a dict/JSON with:
{
"leadtime_min": 13,
"CM1": 8,
"CM3": 3,
"CM4": 13,
}
My approach for a function so far:
My query code in my function returns all rows and not only the matching ones and I need that solved.
My code so far:
df_model = pd.DataFrame(
{
"type": ["E2", "E2", "E2", "P1", "P1"],
"mainclass": ["C/C", "W", "P", "C/C", "GR"],
"subclass": ["C", "W", "13.3", "C", "G"],
"param1": ["IT", "IT", np.nan, "IT", "NAI"],
"param2": [np.nan, np.nan, np.nan, np.nan, "XSM"],
"param3": ["GR", np.nan, np.nan, "TU", np.nan],
"param4": [np.nan, "AC-8", np.nan, np.nan, np.nan],
}
)
df_components = pd.DataFrame(
{
"mainclass": [
"C/C",
"C/C",
"W",
"P",
"P",
"B",
],
"subclass": ["C", "C", "W", "13.3", "15.3", "BR"],
"param1": ["IT", "IT", "IT", np.nan, np.nan, np.nan],
"param2": ["X", np.nan, np.nan, np.nan, np.nan, "KRL"],
"param3": ["GR", "GR", np.nan, np.nan, np.nan, np.nan],
"param4": [np.nan, "RAD", "AC-8", np.nan, np.nan, np.nan],
"component_name": ["CM1", "CM2", "CM3", "CM4", "CM5", "CM6"],
"delivery_time": [8, 9, 3, 13, 20, 5],
}
)
def get_delivery_time_best(model_type: str, model: pd.DataFrame, component: pd.DataFrame)
model = model.query("type == @model_type")
fitting_components = {"delivery_time_min": sys.maxsize}
delivery_times_all = []
for model_component in df_model.itertuples():
model_mainclass = model_component.mainclass
model_subclass = model_component.subclass
model_param1 = model_component.param1
model_param2 = model_component.param2
model_param3 = model_component.param3
model_param4 = model_component.param4
checked_components = components.query(
"""
mainclass == @model_mainclass and \
subclass == @model_subclass \
and ( param1 == @model_param1 | @model_param1 != @model_param1 ) \
and ( param2 == @model_param2 | @model_param2 != @model_param2 ) \
and ( param3 == @model_param3 | @model_param3 != @model_param3 ) \
and ( param4 == @model_param4 | @model_param4 != @model_param4 ) \
"""
)
component_descriptions = checked_components"component_name" ].tolist()
delivery_times = checked_components["delivery_time"].tolist()
delivery_time_min = min(delivery_times )
delivery_times_all.append(delivery_time_min )
for component_description, delivery_time in zip(component_descriptions, delivery_times ):
if delivery_time == delivery_time_min :
fitting_components[component_description] = delivery_time
fitting_components["delivery_time_min "] = max(delivery_times_all )
print(fitting_components)
get_delivery_time_best(model_type="E2", model=df_model, components=df_components)
But that returns all rows from df_components and not the 2 matching rows?
Upvotes: 1
Views: 318
Reputation: 71687
Merge
and boolean indexing
merged = df_components.merge(df_model, on=['mainclass', 'subclass'], suffixes=['', '_r'])
l = merged.filter(regex=r'^param\d+$')
r = merged[l.columns + '_r']
m = ((l.values == r.values) | r.isna().values).all(1)
merged[m].drop(r.columns, 1)
Merge the dataframe df_components
with df_model
on columns 'mainclass', 'subclass'
and add a suffix _r
to the right df to uniquely identify the columns from right dataframe
>>> merged
mainclass subclass param1 param2 param3 param4 param1_r param2_r param3_r param4_r
0 1a 1b vendor1 spec1 id1 NaN vendor1 NaN id1 NaN
1 1a 1b vendor1 NaN id1 rout3 vendor1 NaN id1 NaN
2 1a 1b vendor1 NaN id2 NaN vendor1 NaN id1 NaN
3 1a 1b NaN NaN NaN NaN vendor1 NaN id1 NaN
4 1a 1b vendor2 NaN id1 NaN vendor1 NaN id1 NaN
5 1a 1b vendor2 NaN NaN NaN vendor1 NaN id1 NaN
Filter the param like columns from the left and right dataframe
>>> l
param1 param2 param3 param4
0 vendor1 spec1 id1 NaN
1 vendor1 NaN id1 rout3
2 vendor1 NaN id2 NaN
3 NaN NaN NaN NaN
4 vendor2 NaN id1 NaN
5 vendor2 NaN NaN NaN
>>> r
param1_r param2_r param3_r param4_r
0 vendor1 NaN id1 NaN
1 vendor1 NaN id1 NaN
2 vendor1 NaN id1 NaN
3 vendor1 NaN id1 NaN
4 vendor1 NaN id1 NaN
5 vendor1 NaN id1 NaN
Create a boolean mask by comparing the param like columns in left dataframe to the corresponding param like columns in right dataframe, Similarly create another boolean mask by checking the right dataframe for NaN
values
>>> l.values == r.values
array([[ True, False, True, False],
[ True, False, True, False],
[ True, False, False, False],
[False, False, False, False],
[False, False, True, False],
[False, False, False, False]])
>>> r.isna().values
array([[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True],
[False, True, False, True]])
Take the logical or of the above masks and reduce the resulting mask with all
along axis=1
>>> m
array([ True, True, False, False, False, False])
Now filter the rows in the df_components
with the help of mask m
>>> merged[m].drop(r.columns, 1)
mainclass subclass param1 param2 param3 param4
0 1a 1b vendor1 spec1 id1 NaN
1 1a 1b vendor1 NaN id1 rout3
Upvotes: 3