Reputation: 807
I am writing a new module where in I am polling a couple of thousand records from kafka every minute throughout the day and then splitting each of them into two tables and then committing to kafka broker. I intend to run some aggregate queries on the few million records collected the previous day
I am splitting up the records into two tables as the payload is dynamic in nature and I am interested in only a few fields in the json payload. My assumption is that the entire row gets loaded in memory on the db even while running a query even though the aggregate has to run on just two columns. So just extract the columns responsible for the counts into a separate table from the start.
Customer_Count Where in I run aggregate queries on counts per customer type per purchase type.
Customer_Payload Wherein I plan to just archive the full payload to an object store later.
I plan to do batch insert within one transactional block, first to the payload table and then to the counts table, assuming that failure in inserting any of the records in either of the tables because of exceptions, app or db crash causes batch inserts to both of them to rollback.
Since I am writing a couple of thousand records per transaction into two tables, is there a possibility that a db crash or an app crash while commit was going on leads to partial writes to one of the tables?
My assumption is that as this is synchronous transaction, any db crash before that commit when thru at the db level will be rolled back.
Same for any crash in the spring boot application that the transaction will not committed.
I am being extra cautious as these metrics result in some revenue operations downstream, hence the question on the possibility of partial commits.
The tables look somewhat like this
Counts Table
create table customer_counts
(
id bigserial PK,
customer_id string Not Null,
count int,
purchase_type String,
process_dt date
)
create index metric_counts_idx on (customer_id, purchase_type, process_dt)
Payload table
create table customer_payload
(
id bigserial PK,
customer_id string Not Null,
payload text,
process_dt date
)
create index metric_payload_idx on (customer_id, process_dt)
Then I run a
select sum(count), customer_id, purchase_type
from customer_counts
group by customer_id, purchase_type
on the counts table at the end of the day on a few million records.
I just use the payload table to select and push to an object store.
PS: I was also wondering if creating an index on customer_id, purchase_type, count could save me from the trouble of creating an extra table just for counts but from what I read, indexes are only meant for lookups and the aggregate will run after loading the entire row. You cannot guarantee if the query planners takes the index into account every time. Any suggestions on this approach will also help simplify my design from two tables to one table limiting the question on partial commits to just one table.
I plan to use the default settings in postgresql for transactions and commits. We use Spring Boot JdbcTemplate for db access and the @Transactional
block at java app level. The size of the payload varies between .5 KB to 10 KB. I also index on the customer id, purchase_type and date. The postgres version is 9.6.
Upvotes: 0
Views: 241
Reputation: 14081
You will not see partially-committed transactions. Nothing about your setup seems worrysome.
The "entire row" thing isn't quite right. PG actually loads things a page at a time, which usually means >1 row - but a page will only contain fairly compact row data, large values get compressed and stored out-of-band (aka TOAST). If you neither select nor filter on payload, you should not end up reading most of its field data.
As to your PS, I think this should actually be amenable to an index-only scan. AIUI, you would only be INSERTing and never UPDATE/DELETEing, which should mean the vast majority of the table is visible to all transactions, which is the big factor in making index-only scans worth it. You would want to use a single index on customer_id, purchase_type and count, which could be used to satisfy your final query.
Upvotes: 2