Reputation: 3548

Invalid resource manager ID in primary checkpoint record

I've update my Airbyte image from 0.35.2-alpha to 0.35.37-alpha. [running in kubernetes]

When the system rolled out the db pod wouldn't terminate and I [a terrible mistake] deleted the pod. When it came back up, I get an error -

PostgreSQL Database directory appears to contain a database; Skipping initialization

2022-02-24 20:19:44.065 UTC [1] LOG:  starting PostgreSQL 13.6 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.3.1_git20211027) 10.3.1 20211027, 64-bit
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv4 address "0.0.0.0", port 5432
2022-02-24 20:19:44.065 UTC [1] LOG:  listening on IPv6 address "::", port 5432
2022-02-24 20:19:44.071 UTC [1] LOG:  listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2022-02-24 20:19:44.079 UTC [21] LOG:  database system was shut down at 2022-02-24 20:12:55 UTC
2022-02-24 20:19:44.079 UTC [21] LOG:  invalid resource manager ID in primary checkpoint record
2022-02-24 20:19:44.079 UTC [21] PANIC:  could not locate a valid checkpoint record
2022-02-24 20:19:44.530 UTC [1] LOG:  startup process (PID 21) was terminated by signal 6: Aborted
2022-02-24 20:19:44.530 UTC [1] LOG:  aborting startup due to startup process failure
2022-02-24 20:19:44.566 UTC [1] LOG:  database system is shut down

Pretty sure the WAL file is corrupted, but I'm not sure how to fix this.

Upvotes: 4

Answers (4)

help upback cloud

Reputation: 1

Another thing to consider is checking the PostgreSQL configuration for potential misalignments, like incorrect wal_level or checkpoint_timeout settings. Misconfigurations here can sometimes cause issues during recovery if checkpoints or WAL segments don’t align properly.

It’s also worth verifying that the storage layer (e.g., file system or RAID) isn’t introducing corruption. Silent disk errors can occasionally lead to problems like this.

Upvotes: 0

jetperfect

Reputation: 1

Unfortunately this morning my system also had the same error.

The error has been resolved successfully and the database is operating stably again. No data loss detected.

Some suggestions to fix this error:

Backup the data folder to a separate area to avoid loss.

Use the trick to shorten postgres's automatic restart:

services:
  database:
    image: "postgres:13.4-buster"
    entrypoint: ["tail", "-f", "/dev/null"]
    ...

Access the container and run the following commands:

    > docker exec -it  $(docker ps -q -f "name=<container-name>") bash
    > pg_resetwal --dry-run /var/lib/postgresql/data/pgdata
    > pg_resetwal /var/lib/postgresql/data/pgdata

      Write-ahead log reset

If there are no problems, remove the line of code in step 2 and restart the service.

Good luck!

Upvotes: 0

Daniel Aguilera

Reputation: 71

The su command is messing with PATH so the easiest solution is to just use gosu to drop from root to postgres gosu postgres pg_resetxlog /var/lib/postgresql/data. Hopefully that works for you!

Upvotes: 0

Jeremy

Reputation: 3548

Warning - there is a potential for data loss

This is a test system, so I wasn't concerned with keeping the latest transactions, and had no backup.

First I overrode the container command to keep the container running but not try to start postgres.

...
    spec:
      containers:
        - name: airbyte-db-container
          image: airbyte/db
          command: ["sh"]
          args: ["-c", "while true; do echo $(date -u) >> /tmp/run.log; sleep 5; done"]
...

And spawned a shell on the pod -

kubectl exec -it -n airbyte airbyte-db-xxxx -- sh

Run pg_reset_wal

# dry-run first
pg_resetwal --dry-run /var/lib/postgresql/data/pgdata

Success!

pg_resetwal /var/lib/postgresql/data/pgdata
Write-ahead log reset

Then removed the temp command in the container, and postgres started up correctly!

Upvotes: 7

Invalid resource manager ID in primary checkpoint record

Answers (4)

Related Questions