AWS Sagemaker: What data format to pass to Estimator?

Question

I'm following Sagemaker's k_nearest_neighbors_covtype example and had some questions about the way they pass their training data to the model.

For those who have not seen it, they load data from the internet, run some preprocessing, then save it to an S3 bucket in some sort of binary format (protobuf/recordIO). Their code is as follows:

import numpy as np
import boto3
import os
import sagemaker
import io
import sagemaker.amazon.common as smac

# preprocess
raw_data_file = os.path.join(data_dir, "raw", "covtype.data.gz")
raw = np.loadtxt(raw_data_file, delimiter=',')

# split into train/test with a 90/10 split
np.random.seed(0)
np.random.shuffle(raw)
train_size = int(0.9 * raw.shape[0])
train_features = raw[:train_size, :-1]
train_labels = raw[:train_size, -1]
test_features = raw[train_size:, :-1]
test_labels = raw[train_size:, -1]

# write to buffer
buf = io.BytesIO()
smac.write_numpy_to_dense_tensor(buf, train_features, train_labels)
buf.seek(0)

# upload to s3
bucket = sagemaker.Session().default_bucket()
prefix = 'knn-blog-2018-04-17'
key = 'recordio-pb-data'

boto3.resource('s3').Bucket(bucket).Object(os.path.join(prefix, 'train', key)).upload_fileobj(buf)
s3_train_data = 's3://{}/{}/train/{}'.format(bucket, prefix, key)
print('uploaded training data location: {}'.format(s3_train_data))

Later, when calling model.fit(), they pass the S3 bucket path as the training dataset.

I'm having trouble understanding how the data needs to be structured from this example and I am also wondering if there is a simpler way to load data directly from a pandas dataframe.

My Question:

Let's say after preprocessing I have a pandas dataframe in the following format (~10k records):

type         brown   green   red     yellow
NAME                                       
awfulbrown     0.00   33.33   33.33   33.33
candyapple     0.00    0.00  100.00    0.00
grannysmith    2.96   95.19    0.00    0.72

I want to pass this to nearest neighbors and have it map/cluster based on type (color) weights, with each point labeled by NAME. For example, point candyapple will be located at 100 on the red axis, 0.00 on the green and yellow. The intention then to pass a new set of color coordinates (eg. red: 90.09, yellow: 0.33, green: 9.58 would return candyapple) and return the single nearest neighbor to that point (the closest approximation of those values we have stored in our records).

What further preprocessing do I need to perform on this dataframe before passing it to Sagemaker's KNN model?
What is the simplest way to pass the dataframe? Is there a way to pass it directly to the model?

Julien Simon · Accepted Answer

You can't pass a dataframe directly to the built-in KNN algo. It supports two input training formats: CSV, or RecordIO protobuf: https://docs.aws.amazon.com/sagemaker/latest/dg/kNN-in-formats.html.

The latter is more efficient, so it's the one we recommend.

In your case, you would simply need to convert your dataframe to a numpy array with to_numpy(), and then you can reuse the code in the notebook.

import pandas as pd
index = [1, 2, 3, 4]
a = ['a', 'b', 'c', 'd']
b = [1, 2, 3, 4]
df = pd.DataFrame({'A': a, 'B': b}, index=index)
n = df.to_numpy()
print(n)
type(n)

The notebook you're using is actually showing how to use KNN for classification. This clustering example may be easier to understand: https://data.solita.fi/machine-learning-building-blocks-in-aws-sagemaker/

AWS Sagemaker: What data format to pass to Estimator?

Answers (1)

Related Questions