Shmulik Asafi
Shmulik Asafi

Reputation: 203

Idempotent streams or preventing duplicate rows using PipelineDB

My application produces rotating log files containing multiple application metrics. The log file is rotated once a minute, but each file is still relatively large (over 30MB, with 100ks of rows)

I'd like to feed the logs into PipelineDB (running on the same single machine) which Countiuous View can create for me exactly the aggregations I need over the metrics.

I can easily ship the logs to PipelineDB using copy from stdin, which works great.

However, a machine might occasionally power off unexpectedly (e.g. due to power shortage) during the copy of a log file. Which means that once back online there is uncertainty how much of the file has been inserted into PipelineDB.

How could I ensure that each row in my logs is inserted exactly once in such cases? (It's very important that I get complete and accurate aggregations)

Notice each row in the log file has a unique identifier (serial number created by my application), but I can't find in the docs the option to define a unique field in the stream. I assume that PipelineDB's design is not meant to handle unique fields in stream rows

Nonetheless, are there any alternative solutions to this issue?

Upvotes: 1

Views: 120

Answers (1)

Derek Nelson
Derek Nelson

Reputation: 178

Exactly once semantics in a streaming (infinite rows) context is a very complex problem. Most large PipelineDB deployments use some kind of message bus infrastructure (e.g. Kafka) in front of PipelineDB for delivery semantics and reliability, as that's not PipelineDB's core focus.

That being said, there are a couple of approaches you could use here that may be worth thinking about.

First, you could maintain a regular table in PipelineDB that keeps track of each logfile and the line number that it has successfully written to PipelineDB. When beginning to ship a new logfile, check it against this table to determine which line number to start at.

Secondly, you could separate your aggregations by logfile (by including a path or something in the grouping) and simply DELETE any existing rows for that logfile before sending it. Then use combine to aggregate over all logfiles at read time, possibly with a VIEW.

Upvotes: 1

Related Questions