Reputation: 1269
I managed to load and merge the 6 heavy excel files I had from my RStudio instance (on EC2 server) into one single table in PostgreQSL (linked with RDS).
Now this table has 14 columns and 2,4 Million rows.
The size of the table in PostgreSQL is 1059MB.
The EC2 instance is a t2.medium.
I wanted to analyze it, so I thought I could simply load the table with DBI
package and perform different operations on it.
So I did:
my_big_df <- dbReadTable(con, "my_big_table")
my_big_df <- unique(my_big_df)
and my RStudio froze, out of memory...
My questions would be:
1) Is what I have been doing (to handle big tables like this) a ok/good practice?
2) If yes to 1), is the only way to be able to perform the unique()
operation or other similar operations to increase the EC2 server memory?
3) If yes to 2), how can I know to which extent should I increase the EC2 server memory?
Thanks!
Upvotes: 0
Views: 146
Reputation: 852
dbReadTable
convert the entire table to a data.frame
, which is not what you want to do for such a big tables.
As @cory told you, you need to extract the required info using SQL
queries.
You can do that with DBI
using combinations of dbSendQuery
,dbBind
,dbFetch
or dbGetQuery
.
For example, you could define a function to get the required data
filterBySQLString <- function(databaseDB,sqlString){
sqlString <- as.character(sqlString)
dbResponse <- dbSendQuery(databaseDB,sqlString)
requestedData <- dbFetch(dbResponse)
dbClearResult(dbResponse)
return(requestedData)
}
# write your query to get unique values
SQLquery <- "SELECT * ...
DISTINCT ..."
my_big_df <- filterBySQLString(myDB,SQLquery)
my_big_df <- unique(my_big_df)
If you cannot use SQL
, then you have two options:
1) stop using Rstudio and try to run your code from the terminal or via Rscript
.
2) beef up your instance
Upvotes: 1