Jabb
Jabb

Reputation: 3502

Pandas: How to insert dataframe into Clickhouse

I am trying to insert a Pandas dataframe into Clickhouse.

This is my code

import pandas
import sqlalchemy as sa

uri = 'clickhouse://default:@localhost/default'
ch_db = sa.create_engine(uri)

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

pdf.to_sql('test_humans', ch_db, if_exists='append', index=False)

And this is the error that I am receiving. Does this have to do with some missing extra arguments regarding the ENGINE? How can I fix this?

Exception: Code: 62, e.displayText() = DB::Exception: Syntax error: failed at position 65 (line 7, col 2): FORMAT TabSeparatedWithNamesAndTypes. Expected one of: ENGINE, storage definition (version 19.15.2.2 (official build))

Logging

INFO:sqlalchemy.engine.base.Engine:EXISTS TABLE test_humans INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine: CREATE TABLE test_humans ( first_name TEXT, year BIGINT )

INFO:sqlalchemy.engine.base.Engine:{} INFO:sqlalchemy.engine.base.Engine:ROLLBACK

Upvotes: 6

Views: 36920

Answers (4)

Danila Ganchar
Danila Ganchar

Reputation: 11242

You can do it without sqlalchemy.

pip install clickhouse-driver

from clickhouse_driver import Client


client = Client('localhost')
df = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

client.insert_dataframe(
    'INSERT INTO "your_table" (year, first_name) VALUES',
    df,
    settings=dict(use_numpy=True),
)

# or 
client.execute("INSERT INTO your_table VALUES", df.to_dict('records'))

Upvotes: 5

vladimir
vladimir

Reputation: 15226

sqlalchemy-clickhouse cannot create table automatically (at least until version 0.1.5.post0 inclusively) and interprets any sql-query of table creation as SELECT-query that always tail by FORMAT TabSeparatedWithNamesAndTypes-clause.

To work around this issue need to create a table manually using the infi.clickhouse_orm-api (this module delivered with sqlalchemy-clickhouse):

import pandas as pd
from infi.clickhouse_orm.engines import Memory
from infi.clickhouse_orm.fields import UInt16Field, StringField
from infi.clickhouse_orm.models import Model
from sqlalchemy import create_engine


# define the ClickHouse table schema
class Test_Humans(Model):
    year = UInt16Field()
    first_name = StringField()
    engine = Memory()


engine = create_engine('clickhouse://default:@localhost/test')

# create table manually
with engine.connect() as conn:
    conn.connection.create_table(Test_Humans) # https://github.com/Infinidat/infi.clickhouse_orm/blob/master/src/infi/clickhouse_orm/database.py#L142

pdf = pd.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
    # ! sqlalchemy-clickhouse ignores the last item so add fake one
    {}
])

pdf.to_sql('test_humans', engine, if_exists='append', index=False)

Take into account that sqlalchemy-clickhouse ignores the last item so add fake one (see source code and related issue 10).

Let's check this table in DB:

SELECT *
FROM test.test_humans

┌─year─┬─first_name─┐
│ 1994 │ Vova       │
│ 1995 │ Anja       │
│ 1996 │ Vasja      │
│ 1997 │ Petja      │
└──────┴────────────┘

4 rows in set. Elapsed: 0.003 sec.
*/

See the answer https://stackoverflow.com/a/68843676/303298 too.

Upvotes: 1

Alex
Alex

Reputation: 533

Starting with version 0.2.0 clickhouse_driver implements method insert_dataframe. See: https://clickhouse-driver.readthedocs.io/en/latest/api.html#clickhouse_driver.Client.insert_dataframe

Upvotes: 6

stifstyle
stifstyle

Reputation: 31

You can also use https://github.com/kszucs/pandahouse to insert your dataframe without extra conversions.

pip install pandahouse

import pandahouse as ph

pdf = pandas.DataFrame.from_records([
    {'year': 1994, 'first_name': 'Vova'},
    {'year': 1995, 'first_name': 'Anja'},
    {'year': 1996, 'first_name': 'Vasja'},
    {'year': 1997, 'first_name': 'Petja'},
])

connection = dict(database='default',
                  host='localhost',
                  user='default',
                  password='')

ph.to_clickhouse(pdf, 'test_humans', index=False, chunksize=100000, connection=connection)

Upvotes: 3

Related Questions