Amit
Amit

Reputation: 793

How to retrieve data from oracle to R in a faster way than this?

Here is the data frame that i have

trail_df= data.frame(d= seq.Date(as.Date("2020-01-01"), as.Date("2020-02-01"), by= 1),
                      AA= NA,
                      BB= NA,
                      CC= NA)

Now I would loop to the columns of trail_df and get the data of the column names respectively from the oracle database for the given date, which I am doing like this.

for ( i in 2:ncol(trail_df)){
      c_name = colnames(trail_df)[i]
      query = paste0("SELECT * FROM tablename WHERE ID= '",c_name,"' ")   # this query would return Date and price
      result= dbGetQuery(con, query)              # con is the connection variable from db
      
      for (k in nrow(trail_df)){
          trail_df [which(as.Date(result[k,1])==as.Date(trail_df[,1])),i]= result[k,2]
          # just matching the date in trail_df dataframe and pasting the value in front of respective column
  }
}
           

this is the snippet of the code and the dates filtering and all has been taken care of in real code.

The problem is, I have more than 6000 columns and 500 rows, for which I have to match the dates( BECAUSE THE DATES ARE RANDOM) and put the price in front, which is taking like forever now.

I am new in the R language and would appreciate any help which would fasten this code maybe multiprocess if possible in R.

Upvotes: 1

Views: 343

Answers (1)

r2evans
r2evans

Reputation: 161110

There are two steps to this answer:

  1. Use parameterized queries to get the raw data; and
  2. Get this data into the "wide" format you desire.

Parameterized query

My (first) suggestion is to use parameterized queries, which is safer. It may not improve the speed relative to @RonakShah's answer (using sprintf), at least not on the first time.

However, it might help a touch if the query is repeated: DBMSes tend to parse/optimize queries and cache this optimization. When a query changes even a little, this caching cannot happen, and the query is re-optimized. In this case, this cache-invalidation is unnecessary, and can be avoided if we use binding parameters.

query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s)",
                 paste(rep("?", ncol(trail_df[-1])), collapse = ","))
query
# [1] "SELECT * FROM tablename WHERE ID IN (?,?,?)"
res <- dbGetQuery(con, query, params = list(trail_df$ID))

Some thoughts:

  • if the database has many more dates than what you have here, you can restrict the data returned by reducing the date range queries. This will work well if your trail_df dates are close together:

    query <- sprintf("SELECT * FROM tablename WHERE ID IN (%s) and Date between ? and ?",
                     paste(rep("?", ncol(mtcars)), collapse = ","))
    query
    res <- dbGetQuery(con, query, params = c(list(trail_df$ID), as.list(range(df$d))))
    
  • if your dates are more variable and you end up querying many more rows than you actually need, I suggest you can upload your trail_df dates into a temporary table and something like:

    "select tb.Date, tb.ID, tb.Price
       from mytemptable tmp
       left join tablename tb on tmp.d = tb.Date
     where ..."
    

Reshape

It appears as if your database table may be more "long" shaped and you want it "wide" in your frame. There are many ways to reshape from long-to-wide (examples), but these should work:

reshape2::dcast(res, Date ~ ID, value.var = "Price") # 'Price' is the 'value' column, unk here

tidyr::pivot_wider(res, id_cols = "Date", names_from = "ID", values.from = "Price")

Upvotes: 1

Related Questions