Reputation: 113
I'm using Kafka Connect to stream a database with 250,000,000 records. Reading from the source was no issue, it was completely read in an un-noticeable timeframe and stored on the Kafka brokers. The sink side is extremely slow though and going to take days to complete.
I'm using a JDBC Sink Connector with the following configuration:
"config": {
"auto.create": "true",
"auto.evolve": "true",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"insert.mode": "upsert",
"pk.fields": "id",
"pk.mode": "record_value",
"tasks.max": "20",
"batch.size": "5000",
"max.poll.records": "5000"
}
I'm using an aws postgres RDS instance: db.m5.4xlarge (8 cores 16vcpus 64gb ram)
The sink table has no indices other than the primary key. Using RDS performance insights, DataFileRead and WALWriteLock are consuming the vast majority of activity (DataFileRead ~60%, WalWriteLock ~30-40%).
I'm unfamiliar with database speeds and administration, is this typical performance? If not, what methods might I look into for increasing performance? Is this a case of my instance simply not being beefy enough to handle the writes?
Here is an image of the database load, the growth of the green lines near the tail-end is when I dropped all indices on the impacted sink tables.
Upvotes: 1
Views: 1251
Reputation: 113
I cleared all indices other than primary key about 20 minutes prior to posting. This exponentially increased the processing time and several million rows have been processed now.
Upvotes: 1