pythonpython-3.xsqliteamazon-s3in-memory-database

Reputation: 2091

Connect to a database in cloud

I have an SQLite database (110kb) in an S3 bucket. I want to connect to that database every time I run my Python application.

An option is to download database everytime I run the Python application and connect it. But I want to know if there exists a way to connect to that SQLite database through memory, using S3FileSystem and open.

I'm using SQLite3 library in Python 3.6.

Upvotes: 13

Answers (5)

C Perkins

Reputation: 3884

Other answers explain how the sqlite engine can register an alternative "file system" that can work with database files in various cloud storage scenarios. This includes both custom solutions and updated cloud platform solutions. If these systems satisfy your need, by all means go for it.

These possibilities forced me to rethink my previous definite "No" answer, but they ultimately do not change the underlying nature of the sqlite engine. It is still true that the engine has no built-in network/web api and is not designed for that. That is not my opinion, rather it says as much on the official website. It is still true that--even with an alternative file system--the engine works with database "files", so to persist and load in-memory databases still necessitates a "file system". So for most applications and most developers who want an optimal and/or plug-and-play solution that matches the design goals of sqlite, my original answer stands.

Original answer:

~~No, it is not possible to connect directly to an sqlite database stored in the cloud.~~ Even if you want to host the database in memory, it must still be downloaded completely before being loaded into memory. To do that still requires that the database first be loaded from a disc-based file or using DDL commands to create it directly in memory. See Loading and Saving In-Memory Databases.

In such a scenario, once the database was disconnected, it would then need to be re-uploaded to the cloud storage. S3FileSystem.open just returns a data stream. All the stream will allow you to do is download the file to a local store so that it can be opened / manipulated locally.

If you really need a cloud database, you need to look into another sort of hosted database.

Upvotes: 10

Michal Charemza

Reputation: 27052

(Inspired by dacort's answer)

If the database is just used for reads, there is https://github.com/michalc/sqlite-s3-query (full disclosure: written by me)

Modifying an example from its README, say you have my-db.sqlite in a bucket my-bucket in eu-west-2 (and credentials in environment variables):

from sqlite_s3_query import sqlite_s3_query

with \
        sqlite_s3_query(url='https://my-bucket.s3.eu-west-2.amazonaws.com/my-db.sqlite') as query, \
        query('SELECT * FROM my_table WHERE my_column = ?', params=('my-value',)) as (columns, rows):

    for row in rows:
        print(row)

Upvotes: 2

Alex B

Reputation: 2385

Yes, it's possible with EFS:

https://www.lambrospetrou.com/articles/aws-lambda-and-sqlite-over-efs/

AWS recently released integration between AWS Lambda and Amazon EFS. It is supporting NFSv4 lock upgrading/downgrading which is needed by SQLite. This means SQLite engine can have read/write access to files stored on EFS filesystem.

Upvotes: 10

arthur k

Reputation: 1

if all your actions are limited in reading from SQLite, I guess it is possible. But I have no idea if writing is also possible. in my case, I'm using gdal(requires libgdal), and gdal's /vsis3, /vsis3-streaming (based on /vsicurl) gives you the ability to read SQLite and many other data source from the cloud. if you want to use raw SQLite rather than based on gdal's data source layer, you can just write them to your local database via gdal's API, however, if so, why not just download it and read it?

for me, since I'm working on spatial data and gdal's DataSource provides a lot of APIs to manipulate spatial data, this method works fine. I'm still looking for a good way to write to a cloud based SQLite.

FYI, here's the document of gdal virtual file system https://gdal.org/user/virtual_file_systems.html

Upvotes: 0

dacort

Reputation: 873

As other answers indicate, you probably don't want to use SQLite as a primary database in the cloud.

However, as part of a fun side project I wrote an Amazon Athena data source connector that allows you to query SQLite databases in S3 from Athena. In order to do that, I wrote a read-only SQLite interface for S3.

SQLite has a concept of an OS Interface or VFS. Using a Python SQLite wrapper called APSW, you can write a VFS implementation for arbitrary filesystems. This is what I did in my project and I've included the implementation below.

In order to use this, you would first register the VFS and then create a new SQLite connection with this implementation as the driver.

I should note this isn't optimized at all, so will likely still require reading full databases from S3 depending on your queries. But doesn't sound like an issue in this specific case.

S3FS = S3VFS()  # S3VFS defined below

# This odd format is used due to SQLite requirements
sqlite_uri = "file:/{}/{}.sqlite?bucket={}&immutable=1".format(
  S3_PREFIX,
  DATABASE_NAME,
  S3_BUCKET
)

connection = apsw.Connection(sqlite_uri,
  flags=apsw.SQLITE_OPEN_READONLY | apsw.SQLITE_OPEN_URI,
  vfs=S3FS.vfsname
)
cursor = connection.cursor()

Once you have the cursor, you can execute standard SQL statements like so:

for x,y,z in cursor.execute("select x,y,z from foo"):
    print (cursor.getdescription())  # shows column names and declared types
    print (x,y,z)

VFS Implementation (requires APSW library and boto3 for S3 connectivity)

import apsw
import sys
import boto3

VFS_S3_CLIENT = boto3.client('s3')


class S3VFS(apsw.VFS):
    def __init__(self, vfsname="s3", basevfs=""):
        self.vfsname=vfsname
        self.basevfs=basevfs
        apsw.VFS.__init__(self, self.vfsname, self.basevfs)

    def xOpen(self, name, flags):
        return S3VFSFile(self.basevfs, name, flags)


class S3VFSFile():
    def __init__(self, inheritfromvfsname, filename, flags):
        self.bucket = filename.uri_parameter("bucket")
        self.key = filename.filename().lstrip("/")
        print("Initiated S3 VFS for file: {}".format(self._get_s3_url()))

    def xRead(self, amount, offset):
        response = VFS_S3_CLIENT.get_object(Bucket=self.bucket, Key=self.key, Range='bytes={}-{}'.format(offset, offset + amount))
        response_data = response['Body'].read()
        return response_data

    def xFileSize(self):
        client = boto3.client('s3')
        response = client.head_object( Bucket=self.bucket, Key=self.key)
        return response['ContentLength']

    def xClose(self):
        pass

    def xFileControl(self, op, ptr):
        return False

    def _get_s3_url(self):
        return "s3://{}/{}".format(self.bucket, self.key)

Upvotes: 21

Connect to a database in cloud

Answers (5)

Original answer:

Related Questions