Can we use data directly from RDS or df as a data source for training job in Sagemaker, rather than pulling it from from s3 or EFS?

I am using Sagemaker platform for model development and deployment. Data is read from RDS tables and then spitted to train and test df. To create the training job in Sagemaker, I found that it takes data source only as s3 and EFS. For that I need to keep train and test data back to s3, which is repeating the data storing process in RDS and s3. I would want to directly pass the df from RDS as a parameter in tarining job code. Is there any way we can pass df in fit method

    image="581132636225.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-ols-model:latest"
    model_output_folder = "model-output"
    print(image)
    tree = sagemaker.estimator.Estimator(
        image,
        role,
        1,
        "ml.c4.2xlarge",
        output_path="s3://{}/{}".format(sess.default_bucket(), model_output_folder),
        sagemaker_session=sess,
    )

**tree.fit({'train': "s3_path_having_test_data"}, wait=True)**

Upvotes: 2

Answers (2)

diyer0001

Reputation: 113

Yes absolutely. AWS specifies the recommended pattern of dumping tables from your DB to S3 via Database Migration Services (https://aws.amazon.com/dms/) for consumption by Sagemaker. This is the common workflow for many data science workflows and may be the best under some circumstances. However, for other use cases such as inference pipelines or rapid prototyping you can go against the database directly in your submitted jobs. Of course you can also access database directly in Sagemaker studio inside the notebooks.

In your code above you are providing a container capable of processing data in your S3 bucket. Because the container satisfies conventions and requirements Sagemaker knows to run code in your container against your S3 data.

Alternatively, you can use a generalized processing container (may be a way to hijack estimator to do this too) and ensure the code in the container contains your database access layer (sqlalchmey, pandas etc) and preforms the training. As long as there are no security barriers to your database you won't need to dump tables to s3.

image="581132636225.dkr.ecr.ap-south-1.amazonaws.com/sagemaker-ols-model:latest"

tree = sagemaker.estimator.ScriptProcessor( # or SKLearnProcessor if you want to use an off-the-shelf container
    image_uri=image, # skip if you use SKLearnProcessor
    role=role,
    instance_type="ml.c4.2xlarge",
    base_job_name="job_name",
    instance_count=1
)

tree_args = tree.run(
    # this is your code that runs your OLS job and accesses your database directly
    code='mycode.py',
    # inputs only used here to copy any extra code or extra files to your container (or the sagemaker off-the-shelf one)
    inputs=[ProcessingInput(source="./src", destination="/opt/ml/processing/input/code/src"],
)

Upvotes: 0

Gili Nachum

Reputation: 5578

The training data must be read from Amazon S3, Amazon EFS or Amazon FSx for Lustre.
One advantage of this is being able to reproduce your training results later on, as the input data is frozen in time (unless deleted), as apposed to a live DB.

Typical code:

train_df.to_csv("train.csv", header=False, index=False)
boto3.Session().resource("s3").Bucket(bucket).Object(
    os.path.join(prefix, "train/train.csv")
).upload_file("train.csv")
s3_path_having_test_data = "s3://{}/{}/train".format(bucket, prefix)

tree.fit({'train': "s3_path_having_test_data"}, wait=True)

Upvotes: 1

Can we use data directly from RDS or df as a data source for training job in Sagemaker, rather than pulling it from from s3 or EFS?

Answers (2)

Related Questions