user75252
user75252

Reputation: 309

dplyr::tbl returning list instead of table

I am trying to use dplyr in RStudio to manipulate tables in MS SQL Server database. I successfully connected to the database using DBI, ODBC.

Code:

library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
con <- dbConnect(odbc(), 
                 Driver = "SQL Server",
                 Server = "myserver",
                 database = "ABC",
                 UID = "sqladmin",
                 PWD = "pwd",
                 port = '14333')

data <- tbl(con, "abc")

abc is a table within database ABC. The connection is successful (I am able look at the tables and fields) but dplyr::tbl is returning a list of 2 instead of returning the table abc. So data is a list instead of table. Where am I going wrong in this code?

The schema is ABC --> dbo --> abc

image of data object

Upvotes: 1

Views: 1508

Answers (2)

Simon.S.A.
Simon.S.A.

Reputation: 6931

Building on @Konrad's answer, some additional considerations:

  • There is a distinction between a local dataframe and a remote dataframe. remote_data <- tbl(connection, "database_table_name") creates a remote data frame. The data is stored in the source database, but R has a pointer to the database that can be used for querying it.

  • You can load data from a remote dataframe into R memory using local_data <- collect(remote_data) or local_data <- as.dataframe(remote_data). Depending on the size of your remote data this can be very slow, or can crash R due to lack of memory.

  • Both the local and the remote dataframe are dataframes. class(remote_data) and class(local_data) will return the expected type: tbl (a tibble). The remote dataframe is implemented as a list because it needs to store different info from the local dataframe. Try head(remote_data, 100) to view the first 100 rows of the remote table.

  • Remote dataframes can be manipulated using (most) standard dplyr commands. These are translated by dbplyr into the corresponding database syntax and executed on the database.

A good use of remote tables is to perform initial filtering and summarizing of a large table before pulling the summarized data into local R memory for further processing. For example:

library(DBI)
library(odbc)
library(dplyr)
library(dbplyr)
con <- dbConnect(odbc(), 
                 Driver = "SQL Server",
                 Server = "server_name",
                 database = "database_name",
                 UID = "sqladmin",
                 PWD = "pwd",
                 port = '14333')

remote_data <- tbl(con, "database_table_name")

# preview remote table
head(remote_data)

# summarize
prepared_data <- remote_data %>%
  filter(column_1 >= 10) %>%
  group_by(column_2) %>%
  summarize(total = sum(column_2), .groups = 'drop')

# check query of prepared table
show_query(prepared_data)

# draw summarised table into local memory
local_summarised_data <- collect(prepared_data)

Edit: Some additional points following @mykonos' question:

  • Storage of remote tables works differently from storage of local tables.

    In R the command prepared_data <- local_table %>% mutate(new = 2 * old) creates a separate copy of the source data (it is a little more complex than this because of lazy evaluation under the hood, but this is a sufficient way to think of it). If you were removing objects from your workspace with rm you would have to remote both copies.

    However, remote tables are not duplicate copies of the data on the database. The command prepared_data <- remote_table %>% mutate(new = 2 * old) creates a second remote table in R. This means we have two remote table objects in R both pointing back to the same database table (but in different ways).

  • Remote table definitions in R are defined by two components: The database connection and the query the produces the current table. When we manipulate a remote table in R (by default) all we are doing is changing the query. You can use show_query to check the query that is currently defined.

    So, when we create remote_data <- tbl(con, "database_table_name") then remote_data is stored in R as the database connection and a query something like: SELECT * FROM database_table_name.

    When we create prepared_data <- remote_table %>% mutate(new = 2 * old) then prepared_data is stored in R as the same database connection as remote_table but a different query. Something like: SELECT *, new = 2*old FROM database_table_name.

  • Changing remote table connections in R does not affect the database. Because manipulations of remote connections only changes the query, working with database tables via dbplyr can not change the source data. This is consistent with the design and intention of databases.

    If you want to write to a database from R, there are a range of ways to do it. There are a number of questions tagged dbplyr on SO that ask about this.

  • One downside of this approach is that lengthy manipulations of remote tables can perform poorly because the database has to do significant work to show the final result. I recommend you explore options to write temporary / intermediate tables if this is a concern.

Upvotes: 1

Konrad Rudolph
Konrad Rudolph

Reputation: 545588

The code works as expected. What you’re seeing is simply a limitation of the type display in the RStudio data inspector: the actual type returned by tbl is an object of S3 class tbl_SQLiteConnection but it is implemented as a nested list (similar to how data.frames are implemented as lists of columns).

You will be able to work with your data as expected. You can also invoke as_tibble(data) to get a proper tibble back … but you don’t need to do that to work with it!

Upvotes: 2

Related Questions