Analyze big data in R on EC2 server

Question

I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).

Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.

I wanted to analyze it, so I thought I could simply load the table with DBI package and perform different operations on it.

So I did:

my_big_df <- dbReadTable(con, "my_big_table")
my_big_df  <- unique(my_big_df)

and my RStudio froze, out of memory...

My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique() operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?

Thanks!

fra · Accepted Answer

dbReadTable convert the entire table to a data.frame, which is not what you want to do for such a big tables.

As @cory told you, you need to extract the required info using SQL queries. You can do that with DBI using combinations of dbSendQuery,dbBind,dbFetch or dbGetQuery.

For example, you could define a function to get the required data

filterBySQLString <- function(databaseDB,sqlString){
    sqlString <- as.character(sqlString)
    dbResponse <- dbSendQuery(databaseDB,sqlString)
    requestedData <- dbFetch(dbResponse)
    dbClearResult(dbResponse)

    return(requestedData)
}

# write your query to get unique values
SQLquery <- "SELECT * ...
             DISTINCT  ..."

my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df  <- unique(my_big_df)

If you cannot use SQL, then you have two options:

1) stop using Rstudio and try to run your code from the terminal or via Rscript.

2) beef up your instance

Analyze big data in R on EC2 server

Answers (1)

Related Questions