Atharv Thakur
Atharv Thakur

Reputation: 701

Big data analysis on Amazon Aurora RDS

I have a Aurora table that has 500 millions of records . I need to perform Big data analysis like finding diff between two tables . Till now i have been doing this using HIVE on files system ,but now we have inserted all files rows into Aurora DB . But still monthly i need to do the same thing finding diff.

So to this what colud be the best option ?

  1. Exporting Aurora data back to S3 as files and then running HIVE query on that(how much time it might take to export all Aurora rows into S3)?
  2. Can i run HIVE query on Aurora table ?(I guess hive on Aurora does not support)
  3. Running spark SQL on Aurora (how will be the performance ) ?

Or is there any better way to this .

Upvotes: 0

Views: 1364

Answers (1)

jbgorski
jbgorski

Reputation: 1939

In my opinion Aurora MySQL isn't good option to perform big data analysis. It results from the limitation of MySQL InnoDB and also from additional restrictions on Aurora in relation to MySQL InnoDB. For instance you don't find there such features as data compression or columnar format.

When it comes to Aurora, you can use for instance Aurora Parallel Query, but it doesn't support partitioned tables.

https://aws.amazon.com/blogs/aws/new-parallel-query-for-amazon-aurora/

Other option is to connect directly to Aurora by using AWS Glue and perform the analysis, but in this case you can have problem with the database performance. It can be a bottleneck.

https://docs.aws.amazon.com/glue/latest/dg/populate-add-connection.html

I suggest to import/export the data to s3 by using LOAD DATA FROM S3 / SELECT INTO OUTFILE S3 to S3 and then perform the analysis by using Glue or EMR. You should also consider to use Redshift instead of Aurora.

Upvotes: 1

Related Questions