Reputation: 33928
I would like to write CSV data directly from a bytes (or string) object in memory to duckdb database file (i.e. I want to avoid having to write and read the temporary .csv files). This is what I've got so far:
import io
import duckdb
data = b'a,b,c\n0,1,2\n3,4,5'
rawtbl = duckdb.read_csv(
io.BytesIO(data), header=True, sep=","
)
con = duckdb.connect('some.db')
con.sql('CREATE TABLE foo AS SELECT * FROM rawtbl')
which throws following exception:
---------------------------------------------------------------------------
IOException Traceback (most recent call last)
Cell In[1], line 10
5 rawtbl = duckdb.read_csv(
6 io.BytesIO(data), header=True, sep=","
7 )
9 con = duckdb.connect('some.db')
---> 10 con.sql('CREATE TABLE foo AS SELECT * FROM rawtbl')
IOException: IO Error: No files found that match the pattern "DUCKDB_INTERNAL_OBJECTSTORE://2843be5a66472f9c"
However, it is possible to do:
>>> duckdb.sql('CREATE TABLE foo AS SELECT * FROM rawtbl')
>>> duckdb.sql('show tables')
┌─────────┐
│ name │
│ varchar │
├─────────┤
│ foo │
└─────────┘
>>> duckdb.sql('SELECT * from foo')
┌───────┬───────┬───────┐
│ a │ b │ c │
│ int64 │ int64 │ int64 │
├───────┼───────┼───────┤
│ 0 │ 1 │ 2 │
│ 3 │ 4 │ 5 │
└───────┴───────┴───────┘
since rawtbl
is a duckdb.duckdb.DuckDBPyRelation
object. But that is the in-memory duckdb database, not the 'some.db' file.
How to read csv data directly from bytes (or a string) to duckdb database file, without using intermediate CSV files?
duckdb 0.10.2 on Python 3.12.2 on Ubuntu
Upvotes: 3
Views: 821
Reputation: 13427
You can create the connection before you use read_csv
and pass the connection into it.
import io
import duckdb
from pathlib import Path
data = b'a,b,c\n0,1,2\n3,4,5'
db_path = 'some.db'
Path(db_path).unlink(missing_ok=True)
with duckdb.connect(db_path) as con:
rawtbl = duckdb.read_csv(
io.BytesIO(data), header=True, sep=",", connection=con,
)
con.execute('''
CREATE TABLE foo as select * from rawtbl
''')
with duckdb.connect(db_path) as con:
res = con.sql('select * from foo')
print(res)
# ┌───────┬───────┬───────┐
# │ a │ b │ c │
# │ int64 │ int64 │ int64 │
# ├───────┼───────┼───────┤
# │ 0 │ 1 │ 2 │
# │ 3 │ 4 │ 5 │
# └───────┴───────┴───────┘
Upvotes: 5
Reputation: 21580
I'm not entirely sure on best practices, but I did manage to get it to work with ATTACH.
duckdb.sql("attach 'some.db'")
duckdb_databases()
gave me the names.
duckdb.sql("from duckdb_databases()")
┌───────────────┬──────────────┬─────────┬───┬──────────┬─────────┬──────────┐
│ database_name │ database_oid │ path │ … │ internal │ type │ readonly │
│ varchar │ int64 │ varchar │ │ boolean │ varchar │ boolean │
├───────────────┼──────────────┼─────────┼───┼──────────┼─────────┼──────────┤
│ memory │ 1080 │ NULL │ … │ false │ duckdb │ false │
│ some │ 1489 │ some.db │ … │ false │ duckdb │ false │
│ system │ 0 │ NULL │ … │ true │ duckdb │ false │
│ temp │ 1479 │ NULL │ … │ true │ duckdb │ false │
├───────────────┴──────────────┴─────────┴───┴──────────┴─────────┴──────────┤
│ 4 rows 7 columns (6 shown) │
└────────────────────────────────────────────────────────────────────────────┘
We can then USE the file-backed db and select the data from rawtbl.
duckdb.sql("""
use "some";
create table foo as (from rawtbl)
""")
Check the result:
con = duckdb.connect("some.db")
con.sql("from foo")
┌───────┬───────┬───────┐
│ a │ b │ c │
│ int64 │ int64 │ int64 │
├───────┼───────┼───────┤
│ 0 │ 1 │ 2 │
│ 3 │ 4 │ 5 │
└───────┴───────┴───────┘
Upvotes: 1