Tutan Ramen
Tutan Ramen

Reputation: 1242

Update 100 million rows in a 2-3 billion row dataset in the cloud on a daily basis

It's a tale as old as time. The business wants to take billions of rows (2-3 billion), stream them from Oracle to the cloud (in our case AWS). So far so good. Then they want to process them in the cloud, still ok here.

Then they want to update a subset of the rows ~5% (call it 125 million) on a daily basis and process the data again.

I'm not saying it's unreasonable, I'm just not sure the most efficient way to tackle this. For some background, I'm a senior full-stack developer, plenty of experience working with AWS and giant datasets, but not by any means a big data expert. Plus there are so many options in the cloud these days for big data, it's hard to know where to start.

So, the question:

Is there a product (preferably in AWS) that's built for this use case?

Since it's on a daily basis, we will need to be able to update all the rows efficiently (quickly) and be able to do our processing before the next day comes and the process starts all over.

The key here is the efficiency of the updates. Things like Glue fall apart because the data would have too many partitions (i.e. ~50-100million in a single table).

Upvotes: 1

Views: 1845

Answers (3)

gusto2
gusto2

Reputation: 12075

There are multiple options and cloud is good that you can try and see what suits you best without spending a lot of time or money with infeasible tool.

The only comment here is that manipulating ~100M rows, it may be worth to have a look at ETL (AWS Glue) or creating temp tables with updated values (Oracle has INSERT FROM SELECT, AWS RedShift has CREATE TABLE AS SELECT)

There are some common options and tools to load/manipulate a large amout of data:

  • AWS Glue - ETL and data catalog. This is a great tool to prepare data before/while loading into another data repository
  • Maybe a RDS will be good enough depending what data you have and what the "process data" means. In my experience, for basic analytics and data manipulation is a well scaled DB good enough in 90% of cases.
  • AWS RedShift is an analytics engine, can process petabyte-scale data, but it is not really suited for data manipulation
  • AWS EMR - managed Spark/Presto/HBase cluster, can be very powerful assuming you can use it properly
  • AWS Athena - managed serverless EMR/Presto interface, it has a simple to use SQL-like interface

Note:

  • if you compress data when moving to AWS, consider using BZIP2, as the format is "splittable" and some tools (athena, glue..) could use parallel loading of large files
  • processing large datasets from S3, create a private s3 link to limit excessive egress cost from a public endpoint

Things like Glue fall apart because the data would have too many partitions

Interesting, maybe you could engage aws directly with the use case

Upvotes: 0

Bill Weiner
Bill Weiner

Reputation: 11032

This came across my feed with @Parsifal recommending Redshift. I'll second the use of Redshift for this job and add a few other possibilities.

You don't mention how this data set will be used. I'm guessing you want it in S3 with some reasonable partitioning but this isn't clear. Do you want to be able to run SQL analytics on this data? So there is some uncertainty of optimal solution.

First off Redshift can get expensive especially when it is just sitting idle. So if you just need to process data nightly and nothing else you will want to shutdown Redshift when not needed. If you can use the analytic horsepower of Redshift 24/7 then you will get a lot of performance for your dollar.

Next Redshift eats billion row data sets for breakfast. It has parallel connections to S3 so is fast reading from there. It is, however, a columnar data warehouse implemented as a cluster and while being fast on most (like almost all) operations there are some algorithms that don't work well on this architecture. (There's a reason the map-reduce came along) So make sure that the data processing you need to do will map on to a traditional database. Reach out for help if you are not sure.

Redshift Spectrum is a set of functionality that allows Redshift to operate on S3 files as an "external table". This can greatly simplify the use of S3 as the data storage but this does come with some limitations - you cannot modify these S3 files from inside of Redshift but you can write new files with changed content to S3. There can be some data processing penalty for using S3 as the data store (and in some cases a speed up). You may want to use native Redshift storage for your tables if these are a problem. It doesn't add a huge amount of complexity to your processing.

Redshift databases can be "snapshotted" and then shut down. The next day you can create a new Redshift database from this snapshot and you are off and running. The database will usually be available for use in 10 to 20 min so you may want to start it up a little before you need it in your daily processing.

I'd look at Redshift for your case but a lot about how you will want to use if will depend on factors in how you are using your data beyond this processing step.

Upvotes: 1

Parsifal
Parsifal

Reputation: 4506

I would look at Redshift, primarily because SQL is more familiar than Spark for most people. Assuming that you are uploading the data into S3, the upsert process looks like this (all within a single transaction):

  • Use a COPY command to insert data from a staging area on S3 into a temporary table.
  • If you're actually doing an update (versus an append), delete any rows from the base table.
  • Insert rows from the temporary table into the base

I would implement this as an AWS Batch job, but if you're using Airflow or a similar coordination framework it would work too. For the size of the update, Lambda would probably timeout. Make sure that Redshift is configured to auto-vacuum.

If they're set on using Glue, I would take an approach that uses "editions" of data, so that you're always storing the most up-to-date data. Assuming a nightly job, it would look like this (folder names on S3 are made up):

  1. Read the base dataset from s3://yourbucket/complete/YYYY/MM/D0
  2. Read the day's updates from s3://yourbucket/updates/YYYY/MM/D1
  3. Combine the two datasets, picking the most recent rows for each key value.
  4. Write out the new dataset to s3://yourbucket/complete/YYYY/MM/D1
  5. Do your processing based on the new dataset.

Use a life-cycle rule on S3 to prune old datasets.

I think that you'll find it too slow to just start with a base dataset and apply a complete history of updates. Especially since you're changing 5-10% of the rows each day.

And you definitely don't want to partition by the primary key, because Glue is not very performant with large numbers of small files (this is discussed in the AWS docs, but I can't find a link atm). And if you're updating a 5-10% of your data with each run, you're not going to save much by only updating those files that have changed.

To partition the dataset to improve parallelism, you could hash the primary key and then use some subset of the hash to create partitions (eg, hexify the hash and use the first few hex characters as a partition key).

Upvotes: 1

Related Questions