Aaron Arima
Aaron Arima

Reputation: 184

How to maximize DB upload rate with azure cosmos db

Here is my problem. I am trying to upload a large csv file to cosmos db (~14gb) but I am finding it difficult to maximize the throughput I am paying for. On the azure portal metrics overview UI, it says that I am using 73 RU/s when I am paying for 16600 RU/s. Right now, I am using pymongo's bulk write function to upload to the db but I find that any bulk_write length greater than 5 will throw a hard Request rate is large. exception. Am I doing this wrong? Is there a more efficient way to upload data in this scenario? Internet bandwidth is probably not a problem because I am uploading from an azure vm to cosmos db.

Structure of how I am uploading in python now:

for row in csv.reader:
    row[id_index_1] = convert_id_to_useful_id(row[id_index_1])

    find_criteria = {
        # find query
    }

    upsert_dict = {
        # row data
    }
    operations.append(pymongo.UpdateOne(find_criteria, upsert_dict, upsert=True))

    if len(operations) > 5:

        results = collection.bulk_write(operations)

        operations = []

Any suggestions would be greatly appreciated.

Upvotes: 3

Views: 1543

Answers (4)

Musham Ajay
Musham Ajay

Reputation: 1

I have used ComsodDB Migration tool, which is awesome to send data to CosmosDB without doing much configurations. Even we can send the CSV files which are 14Gb also as per my assumption.

Below is the data which we transferred

[10000 records transferred | throughput 4000 | 500 parellel request | 25 seconds]. [10000 records transferred | throughput 4000 | 100 parellel request | 90 seconds]. [10000 records transferred | throughput 350 | parellel request 10 | 300 seconds].

Upvotes: 0

Aaron Arima
Aaron Arima

Reputation: 184

I was able to improve the upload speed. I noticed that each physical partition had a throughput limit (which for some reason, the number of physical partitions times the throughput per partition is still not the total throughput for the collection) so what I did was split the data by each partition and then create a separate upload process for each partition key. This increased my upload speed by (# of physical partitions) times.

Upvotes: 0

Jay Gong
Jay Gong

Reputation: 23782

Aaron. Yes,as you said in the comment, migration tool is supported by Azure Cosmos DB MongoDB API. You could find the blow statement in the official doc.

The Data Migration tool does not currently support Azure Cosmos DB MongoDB API either as a source or as a target. If you want to migrate the data in or out of MongoDB API collections in Azure Cosmos DB, refer to Azure Cosmos DB: How to migrate data for the MongoDB API for instructions. You can still use the Data Migration tool to export data from MongoDB to Azure Cosmos DB SQL API collections for use with the SQL API.

I just provide you with a workaround that you could use Azure Data Factory. Please refer to this doc to make the cosmos db as sink.And refer to this doc to make the csv file in Azure Blob Storage as source.In the pipeline,you could configure the batch size.

enter image description here

Surely,you could do this programmatically. You didn't miss something, the error Request rate is large just means you have exceeded the provisioned RUs quota. You could raise up the value of RUs setting. Please refer to this doc.

Any concern,please feel free to let me know.

Upvotes: 1

Rob Reagan
Rob Reagan

Reputation: 7686

I'd take a look at the Cosmos DB: Data Migration Tool. I haven't used this with the MongoDB API, but it is supported. I have used this to move lots of documents from my local machine to Azure with great success, and it will utilize RU/s that are available.

If you need to do this programmatically, I suggest taking a look at the underlying source code for DB Migration Tool. This is open source. You can find the code here.

Upvotes: 0

Related Questions