Reputation: 3
I have an input dataset that contains columns that are only known in the future, I use these for training (e.g. Arrival Time). The input data is as follows. { "toAddressPostCode", "toAddressCountryCode", "unloadInDatetime":, "unloadOutDatetime", "loadingMeters", "payableWeight", "name" }
I use the following python code to convert these columns right after import:
import os
import pandas as pd
def transform_data(df):
if 'unloadInDatetime' in df.columns and 'unloadOutDatetime' in df.columns:
# Convert to datetime if not already
datetime_format = '%Y-%m-%d %H:%M:%S.%f' # Format to include microseconds
df['InDateTime'] = pd.to_datetime(df['unloadInDatetime'], format=datetime_format, errors='coerce')
df['OutDateTime'] = pd.to_datetime(df['unloadOutDatetime'], format=datetime_format, errors='coerce')
# Extract date components and hour as integer
df['InDate'] = df['InDateTime'].dt.date
df['InTime'] = df['InDateTime'].dt.hour # Extract hour as integer
df['InTime'] = df['InTime'].astype(int)
df['OutDate'] = df['OutDateTime'].dt.date
df['OutTime'] = df['OutDateTime'].dt.time
# Other transformations
df['toAddressPostCode'] = df['toAddressPostCode'].str[:5]
df['UnloadMonth'] = df['InDateTime'].dt.month
df['UnloadWeekday'] = df['InDateTime'].dt.weekday + 1
# Calculate the duration
df['UnloadingTime'] = (df['OutDateTime'] - df['InDateTime']).dt.total_seconds() / 60
df['UnloadingTime'] = df['UnloadingTime'].astype(int)
# Filter the data
df = df[(df['UnloadingTime'] > 5) & (df['UnloadingTime'] < 300)]
df.reset_index(drop=True, inplace=True)
# Select desired columns
df = df[['UnloadingTime', 'toAddressCountryCode', 'toAddressPostCode', 'UnloadWeekday', 'UnloadMonth', 'loadingMeters', 'payableWeight', 'InTime', 'name']]
else:
print("Required columns not found in the DataFrame")
return df
def azureml_main(dataframe1=None, dataframe2=None):
# Perform your data transformation
if dataframe1 is not None:
df_transformed = transform_data(dataframe1)
else:
df_transformed = pd.DataFrame()
return df_transformed, None
From this I would like to use these columns as the input data for my endpoint: { "toAddressCountryCode" "toAddressPostCode" "UnloadWeekday" "UnloadMonth" "loadingMeters" "payableWeight" "InTime" "name" }
Is there any way to achieve this?
First of all I ran the flow and implemented it as an end point. The flow looks as follows: enter image description here
I have looked into the entire flow to see if the columns I wanted to use were present, and they were. In the endpoint testing windows I tried to input the desired columns, but then an error pops up saying the "input data are inconsistent with schema".
Upvotes: 0
Views: 161
Reputation: 8160
Whenever you try to score a model endpoint, it should match the endpoint schema on which it is trained.
If it is not matching, transform the input data so that it matches the endpoint input schema.
Alternatively, modify the scoring script to transform the input data in a way that matches the prediction and create an endpoint.
You can make these changes while using a registered model to create an online endpoint with a custom scoring script.
Go to the models tab and open the registered model. You will see an option to deploy; click on it and you will be prompted for several inputs. There, you need to upload the altered scoring script.
Below is a sample script you can refer to score.py
Upvotes: 0