Lucy
Lucy

Reputation: 51

How to retrieve more than 100000 rows from Redshift using R and dplyr

I'm analyzing data from a Redshift database, working in R using a connection per dplyr - which works:

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable")

viewstation<-mytable %>%
    filter(stationname=="something") 

When I try to turn that output into a data frame, so:

thisdata<-data.frame(viewstation)

I get the error message, Warning message:

Only first 100,000 results retrieved. Use n = -1 to retrieve all. 

Where am I supposed to set n?

Upvotes: 4

Views: 1613

Answers (2)

qfazille
qfazille

Reputation: 1671

For those who are still using dplyr 0.5 (like me).

The parameter n is part of the collect function.

my_db<-src_postgres(host='my-cluster-blahblah.redshift.amazonaws.com', port='5439', dbname='dev',user='me', password='mypw')
mytable <- tbl(my_db, "mytable") %>% collect(n = Inf)

This will get you more than 100.000 rows.

Upvotes: 0

phiver
phiver

Reputation: 23598

Instead of using

thisdata<-data.frame(viewstation)

use

thisdata <- collect(viewstation)

collect() will pull all the data from the database back into R. As mentioned in the DPLYR::databases vignette:

When working with databases, dplyr tries to be as lazy as possible. It’s lazy in two ways:

It never pulls data back to R unless you explicitly ask for it.

It delays doing any work until the last possible minute, collecting together everything you want to do then sending that to the database in one step.

Upvotes: 7

Related Questions