Kasia Kulma
Kasia Kulma

Reputation: 1732

Unable to download a complete db file

I'm trying to download a db file from a GitHub repo using the following code:

library(RSQLite)
library(curl)

url <- "https://github.com/kotartemiy/newscatcher/tree/master/newscatcher/data/package_rss.db"
curl::curl_download(url = url,
              destfile = "inst/external-data/package_rss.db",
              quiet = TRUE, mode = "wb")

Which works, but downloads the file size between 79Kb and 82Kb (depending on which mode I use). But when I try to access the database file I get the warning:

sqlite.driver <- dbDriver("SQLite")
db <- dbConnect(sqlite.driver,
                dbname = "inst/external-data/package_rss.db")

Warning message: Couldn't set synchronous mode: file is not a database Use synchronous = NULL to turn off this warning.

Followed by the error:

dbListTables(db)

Error: file is not a database

This can be reproduced using download.file() and different mode arguments. However, if I download the file manually it has 376 Kb and the RSQLite code works without any problems. What may be causing the issue? Thanks

Upvotes: 1

Views: 750

Answers (1)

r2evans
r2evans

Reputation: 160982

As @27ϕ9 said, you're downloading a webpage, not the file it references.

url <- "https://github.com/kotartemiy/newscatcher/tree/master/newscatcher/data/package_rss.db"
download.file(url, "~/Downloads/package_rss.db")
# trying URL 'https://github.com/kotartemiy/newscatcher/tree/master/newscatcher/data/package_rss.db'
# Content type 'text/html; charset=utf-8' length unknown
# downloaded 82 KB

readLines("~/Downloads/package_rss.db", n=10)
#  [1] ""                                                                      
#  [2] ""                                                                      
#  [3] ""                                                                      
#  [4] ""                                                                      
#  [5] ""                                                                      
#  [6] "<!DOCTYPE html>"                                                       
#  [7] "<html lang=\"en\">"                                                    
#  [8] "  <head>"                                                              
#  [9] "    <meta charset=\"utf-8\">"                                          
# [10] "  <link rel=\"dns-prefetch\" href=\"https://github.githubassets.com\">"

If you go to that URL in a browser, you'll see two links on the page:

  1. the "Download" button pushes you to a link under raw.githubusercontent.com (link), so you can hunt for that URL;

  2. There's also a "view raw" link, which takes the same URL you started with, replaces the /tree/ with /blob/, and appends ?raw=true (link).

enter image description here

(While it's possible to harvest the html and get the link programmatically, I think just starting with the correct URL is the preferred route.)

Upvotes: 2

Related Questions