Pass Primary Key and Order By as parameters while dynamically creating a table in ClickHouse using PySpark

Question

I am trying to dynamically create a table in ClickHouse database from Pyspark using MergeTree() Engine. For creating a table in ClickHouse database, I need to pass parameters like ENGINE, PRIMARY KEY or ORDER BY.

So, far I was able to pass Engine parameter in the below code, However I am not sure how to pass other parameters like PRIMARY KEY or ORDER BY.

This is the code snippet I am using :

dataframe.write \
    .option("driver", self.options_params["driver"]) \
    .option("createTableColumnTypes", "id INTEGER, name CHAR(30), designation CHAR(15)") \
    .option("createTableOptions", "primary key (id)") \
    .option("createTableOptions", "order by (id)") \
    .option("createTableOptions", "engine=MergeTree()") \
    .option("truncate", "true") \
    .mode("overwrite") \
    .jdbc(self.options_params["url"], destination_table,
          properties={"user": self.options_params["user"], "password": self.options_params["password"]})

I checked the official documentation of Spark for this, but couldn't find how exactly I need to pass these parameters.

Below is the error I am getting :

Caused by: com.github.housepower.exception.ClickHouseSQLException: DB::ExceptionDB::Exception: ORDER BY or PRIMARY KEY clause is missing. Consider using extended storage definition syntax with ORDER BY or PRIMARY KEY clause. With deprecated old syntax (highly not recommended) storage MergeTree requires 3 to 4 parameters: 
name of column with date,
[sampling element of primary key],
primary key expression,
index granularity

Syntax for the MergeTree table engine:

CREATE TABLE [IF NOT EXISTS] [db.]table_name [ON CLUSTER cluster]
(
    name1 [type1] [DEFAULT|MATERIALIZED|ALIAS expr1] [TTL expr1],
    name2 [type2] [DEFAULT|MATERIALIZED|ALIAS expr2] [TTL expr2],
    ...
    INDEX index_name1 expr1 TYPE type1(...) GRANULARITY value1,
    INDEX index_name2 expr2 TYPE type2(...) GRANULARITY value2
) ENGINE = MergeTree()
ORDER BY expr
[PARTITION BY expr]
[PRIMARY KEY expr]
[SAMPLE BY expr]
[TTL expr [DELETE|TO DISK 'xxx'|TO VOLUME 'xxx'], ...]
[SETTINGS name=value, ...]

See details in documentation: https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/mergetree/. Other engines of the family support different syntax, see details in the corresponding documentation topics.

I couldn't get much info on the internet, does anybody have a workable solution for this ?Thanks in advance !

edit:

Solution

df.write \
            .format("jdbc") \
            .mode("overwrite") \
            .option("driver", "com.github.housepower.jdbc.ClickHouseDriver") \
            .option("url", self.parameters["connection_properties"]["url"]) \
            .option("createTableOptions", "engine=MergeTree()"
                                          "primary key (primary_col_1)"
                                          " order by (primary_col_1,toYYYYMM(createdon), toYYYYMM(modifiedon))"
                                          " partition by (partition_col_1)") \
            .option("user", "clickhouse_user") \
            .option("password", "clickhouse_user_password") \
            .option("dbtable", "clickhousedb.table_name") \
            .option("batchsize", "50000") \
            .option("truncate", "true") \
            .save()

user234397 · Accepted Answer

I added like this, and it works:

DF\
.repartition(60)\
.write.format("jdbc")\
.option("url", "jdbc:clickhouse://localhost:8123/raw_base")\
.option("driver", "ru.yandex.clickhouse.ClickHouseDriver")\
.option("dbtable", "raw_base.TABLE_NAME")\
.option("createTableOptions", "engine=MergeTree() order by (ID)")\
.option("user", "default") \
.option("password", "default")\
.mode('overwrite').save()

Pass Primary Key and Order By as parameters while dynamically creating a table in ClickHouse using PySpark

Answers (1)

Related Questions