DBhatta
DBhatta

Reputation: 51

How to read a large table (>100 columns (variables) and 100,000 observations) from SQL Server into R using ODBC package

I'm getting an error to read large table into R from SQL Server.

Here is my connection code:

library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(), 
     .connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')

Here is a schema of my table which has 149 variables:

enter image description here

data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")

I got an error from this code probably because of large table.

I would like to reduce the large table (number of observations) applying "VisitDateTime" variable.

data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")

This code selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table.

Hoping to get some efficient codes. Greatly appreciate your help on this. Thank you.

Upvotes: 1

Views: 537

Answers (1)

Parfait
Parfait

Reputation: 107567

According to your schema, you have many variable length types, varchar, of 255 character lengths. As multiple answers on the similar error post suggests, you cannot rely on arbitrary order with SELECT * but must explicitly reference each column and place variable lengths toward the end of SELECT clause. In fact, generally in application code running SQL, avoid SELECT * FROM. See Why is SELECT * considered harmful?

Fortunately, from your schema output using INFORMATION_SCHEMA.COLUMNS you can dynamically develop such a larger named list for SELECT. First, adjust and run your schema query as an R data frame with a calculated column to order smallest to largest types and their precision/lengths.

schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
                    , sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
                    , sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
               FROM 
                  (SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE 
                        , CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
                        , NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
                        , CASE DATA_TYPE
                                WHEN 'tinyint'   THEN 1
                                WHEN 'smallint'  THEN 2
                                WHEN 'int'       THEN 3
                                WHEN 'bigint'    THEN 4
                                WHEN 'date'      THEN 5
                                WHEN 'datetime'  THEN 6
                                WHEN 'datetime2' THEN 7
                                WHEN 'decimal'   THEN 8
                                WHEN 'varchar'   THEN 9
                                WHEN 'nvarchar'  THEN 10
                          END AS SELECT_TYPE_ORDER
                   FROM INFORMATION_SCHEMA.COLUMNS
                   WHERE SCHEMA_NAME = 'eCW'
                     AND TABLE_NAME = 'Visits'
                  ) sub
               ORDER BY sub.SELECT_TYPE_ORDER
                      , sub.NUMERIC_PRECISION
                      , sub.NUMERIC_PRECISION_RADIX
                      , sub.NUMERIC_SCALE
                      , sub.CHARACTER_MAXIMUM_LENGTH
                      , sub.CHARACTER_OCTET_LENGTH"

visits_schema_df <- dbGetQuery(con, schema_sql)

# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")

# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))

Above may need adjustment if same error arises. Be proactive and test on your end by isolating the problem columns, column types, etc. A few suggestions include filtering out DATA_TYPE, COLUMN_NAME or moving around ORDER columns in schema query.

...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND DATA_TYPE IN ('tinyint', 'smallint', 'int')  -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('varchar', 'nvarchar')     -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT DATA_TYPE IN ('decimal', 'datetime2')    -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
  AND TABLE_NAME = 'Visits'
  AND NOT COLUMN_NAME IN ('LastHIVTestResult')     -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER                         -- ADJUST ORDERING
       , sub.NUMERIC_SCALE                             
       , sub.NUMERIC_PRECISION
       , sub.NUMERIC_PRECISION_RADIX
       , sub.CHARACTER_OCTET_LENGTH
       , sub.CHARACTER_MAXIMUM_LENGTH

Still another solution is to stitch the R data frame together by their types (adjusting schema query) using the chain merge on the primary key (assumed to be DW_Id):

final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
                     list(data_int_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
                          data_num_columns,        # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT 
                          data_dt_columns,         # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
                          data_char_columns)       # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
              )

Upvotes: 2

Related Questions