bardiak
bardiak

Reputation: 31

sagemaker clienterror rows 1-5000 have more fields than expected size 3

I have created a K-means training job with a csv file that I have stored in S3. After a while I receive the following error:

Training failed with the following error: ClientError: Rows 1-5000 in file /opt/ml/input/data/train/features have more fields than than expected size 3.

What could be the issue with my file?

Here are the parameters I am passing to sagemaker.create_training_job

        TrainingJobName=job_name,
        HyperParameters={
            'k': '2',
            'feature_dim': '2'
        },
        AlgorithmSpecification={
            'TrainingImage': image,
            'TrainingInputMode': 'File'
        },
        RoleArn='arn:aws:iam::<my_acc_number>:role/MyRole',
        OutputDataConfig={
            "S3OutputPath": output_location
        },
        ResourceConfig={
            'InstanceType': 'ml.m4.xlarge',
            'InstanceCount': 1,
            'VolumeSizeInGB': 20,
        },
        InputDataConfig=[
            {
                'ChannelName': 'train',
                'ContentType': 'text/csv',
                "CompressionType": "None",
                "RecordWrapperType": "None",
                'DataSource': {
                    'S3DataSource': {
                        'S3DataType': 'S3Prefix',
                        'S3Uri': data_location,
                        'S3DataDistributionType': 'FullyReplicated'
                    }
                }
            }
        ],
        StoppingCondition={
            'MaxRuntimeInSeconds': 600
        }

Upvotes: 2

Views: 1439

Answers (2)

Haroon Salimi
Haroon Salimi

Reputation: 1

Make sure your .csv doesn't have column headers, and that the label is the first column. Also make sure your values for the hyper-parameters are accurate ie feature_dim means number of features in your set. If you give it the wrong value, it'll break.

Heres a list of sagemaker knn hyper-parameters and their meanings: https://docs.aws.amazon.com/sagemaker/latest/dg/kNN_hyperparameters.html

Upvotes: 0

winklerm
winklerm

Reputation: 23

I've seen this issue appear when doing unsupervised learning, such as the above example using clustering. If you have a csv input, you can also address this issue by setting label_size=0 in the ContentType parameter of the Sagemaker API call, within the InputDataConfig branch.

Here's an example of what the relevant section of the call might look like:

"InputDataConfig": [
    {
        "ChannelName": "train",
        "DataSource": {
            "S3DataSource": {
                "S3DataType": "S3Prefix",
                "S3Uri": "some/path/in/s3",                       
                "S3DataDistributionType": "ShardedByS3Key"
            }
        },
        "CompressionType": "None",
        "RecordWrapperType": "None",
        "ContentType": "text/csv;label_size=0"
    }
]

Upvotes: 2

Related Questions