Reputation: 51
I'm getting an error to read large table into R from SQL Server.
Here is my connection code:
library(odbc)
library(DBI)
con <- dbConnect(odbc::odbc(),
.connection_string = 'driver={SQL Server};server=DW01;database=SFAF_DW;trusted_connection=true')
Here is a schema of my table which has 149 variables:
data1 <- dbGetQuery(con, "SELECT * FROM [eCW].[Visits]")
I got an error from this code probably because of large table.
I would like to reduce the large table (number of observations) applying "VisitDateTime" variable.
data2 <- dbGetQuery(con, "SELECT cast(VisitDateTime as DATETIME) as VisitDateTime FROM [eCW].[Visits] WHERE VisitDateTime>='2019-07-01 00:00:00' AND VisitDateTime<='2020-06-30 12:00:00'")
This code selected only "VisitDateTime" variable but I would like to get all (149 variables) from the table.
Hoping to get some efficient codes. Greatly appreciate your help on this. Thank you.
Upvotes: 1
Views: 537
Reputation: 107567
According to your schema, you have many variable length types, varchar
, of 255 character lengths. As multiple answers on the similar error post suggests, you cannot rely on arbitrary order with SELECT *
but must explicitly reference each column and place variable lengths toward the end of SELECT
clause. In fact, generally in application code running SQL, avoid SELECT * FROM
. See Why is SELECT * considered harmful?
Fortunately, from your schema output using INFORMATION_SCHEMA.COLUMNS
you can dynamically develop such a larger named list for SELECT
. First, adjust and run your schema query as an R data frame with a calculated column to order smallest to largest types and their precision/lengths.
schema_sql <- "SELECT sub.TABLE_NAME, sub.COLUMN_NAME, sub.DATA_TYPE, sub.SELECT_TYPE_ORDER
, sub.CHARACTER_MAXIMUM_LENGTH, sub.CHARACTER_OCTET_LENGTH
, sub.NUMERIC_PRECISION, sub.NUMERIC_PRECISION_RADIX, sub.NUMERIC_SCALE
FROM
(SELECT TABLE_NAME, COLUMN_NAME, DATA_TYPE
, CHARACTER_MAXIMUM_LENGTH, CHARACTER_OCTET_LENGTH
, NUMERIC_PRECISION, NUMERIC_PRECISION_RADIX, NUMERIC_SCALE
, CASE DATA_TYPE
WHEN 'tinyint' THEN 1
WHEN 'smallint' THEN 2
WHEN 'int' THEN 3
WHEN 'bigint' THEN 4
WHEN 'date' THEN 5
WHEN 'datetime' THEN 6
WHEN 'datetime2' THEN 7
WHEN 'decimal' THEN 8
WHEN 'varchar' THEN 9
WHEN 'nvarchar' THEN 10
END AS SELECT_TYPE_ORDER
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
) sub
ORDER BY sub.SELECT_TYPE_ORDER
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.NUMERIC_SCALE
, sub.CHARACTER_MAXIMUM_LENGTH
, sub.CHARACTER_OCTET_LENGTH"
visits_schema_df <- dbGetQuery(con, schema_sql)
# BUILD COLUMN LIST FOR SELECT CLAUSE
select_columns <- paste0("[", paste(visits_schema_df$COLUMN_NAME, collapse="], ["), "]")
# RUN QUERY WITH EXPLICIT COLUMNS
data <- dbGetQuery(con, paste("SELECT", select_columns, "FROM [eCW].[Visits]"))
Above may need adjustment if same error arises. Be proactive and test on your end by isolating the problem columns, column types, etc. A few suggestions include filtering out DATA_TYPE
, COLUMN_NAME
or moving around ORDER
columns in schema query.
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND DATA_TYPE IN ('tinyint', 'smallint', 'int') -- TEST WITH ONLY INTEGER TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('varchar', 'nvarchar') -- TEST WITHOUT VARIABLE STRING TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT DATA_TYPE IN ('decimal', 'datetime2') -- TEST WITHOUT HIGH PRECISION TYPES
...
FROM INFORMATION_SCHEMA.COLUMNS
WHERE SCHEMA_NAME = 'eCW'
AND TABLE_NAME = 'Visits'
AND NOT COLUMN_NAME IN ('LastHIVTestResult') -- TEST WITHOUT LARGE VARCHARs
...
ORDER BY sub.SELECT_TYPE_ORDER -- ADJUST ORDERING
, sub.NUMERIC_SCALE
, sub.NUMERIC_PRECISION
, sub.NUMERIC_PRECISION_RADIX
, sub.CHARACTER_OCTET_LENGTH
, sub.CHARACTER_MAXIMUM_LENGTH
Still another solution is to stitch the R data frame together by their types (adjusting schema query) using the chain merge on the primary key (assumed to be DW_Id
):
final_data <- Reduce(function(x, y) merge(x, y, by="DW_Id"),
list(data_int_columns, # SEPARATE QUERY RESULT WITH DW_Id AND INTs IN SELECT
data_num_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DECIMALs IN SELECT
data_dt_columns, # SEPARATE QUERY RESULT WITH DW_Id AND DATE/TIMEs IN SELECT
data_char_columns) # SEPARATE QUERY RESULT WITH DW_Id AND VARCHARs IN SELECT
)
Upvotes: 2