Thomas Segato
Thomas Segato

Reputation: 5213

Best storage in Azure for very large data sets with fast update options

We have around 300 millions documents on a drive. We need to delete around 200 millions of them. I am going to write the 200 million paths to a storage so I can keep track of deleted documents. My current thought's is an Azure SQL database is properly not very suited for this amount. Cosmos DB is to expensive. Storing csv files is bad, because I need to do updates everytime I delete a file. Table storage seems to be a pretty good match, but does not offer groups by operations that could come in handy when doing status reports. I dont know much about data lake, if you can do fast updates or it is more like an archive. All input is welcome for choosing the right storage for this kind of reporting.

Thanks in advance.

Upvotes: 0

Views: 955

Answers (1)

Jim Xu
Jim Xu

Reputation: 23111

According to your need, you can use Azure Cosmos DB or Azure table storage.

Azure Table Storage offers a NoSQL key-value store for semi-structured data. Unlike a traditional relational database, each entity (such as a row - in relational database terminology) can have a different structure, allowing your application to evolve without downtime to migrate between schemas.

Azure Cosmos DB is a multimodal database service designed for global use in mission-critical systems. Not only does it expose a Table API, it also has a SQL API, Apache Cassandra, MongoDB, Gremlin and Azure Table Storage. These allow you to easily swap out existing dbs with a Cosmos DB implementation.

Their differences are as below:

Performance

Azure Table Storage has no upper bound on latency. Cosmos DB defines latency of single-digit milliseconds for reads and writes along with operations at sub-15 milliseconds at the 99th percentile worldwide. (That was a mouthful) Throughput is limited on Table Storage to 20,000 operations per second. On Cosmos DB, there is no upper limit on throughput, and more than 10 million operations per second are supported.

Global Distribution

Azure Table Storage supports a single region with an optional read-only secondary region for availability. Cosmos DB supports distribution from 1 to more than 30 regions with automatic failovers worldwide.

Billing

Azure Table Storage uses storage volume to determine billing.Pricing is tiered to get progressively cheaper per GB the more storage you use. Operations incur a charge measured per 10,000 transactions.

For Cosmos DB, It has tow billing nodule : Provisioned throughput and Consumed Storage.

  • Provisioned Throughput: Provisioned throughput (also called reserved throughput) guarantees high performance at any scale. You specify the throughput (RU/s) that you need, and Azure Cosmos DB dedicates the resources required to guarantee the configured throughput. You are billed hourly for the maximum provisioned throughput for a given hour.
  • Consumed Storage: You are billed a flat rate for the total amount of storage (GBs) consumed for data and the indexes for a given hour.

For more details, please refer to the document.

Upvotes: 1

Related Questions