Our Postgres Infrastructure

Since I'm the one at Honeybadger primarily responsible for ops, and since we rely heavily on Postgres for everything we do, the Gitlab incident struck close to home. We have fortunately never had a comparable failure at Honeybadger, but at a previous startup I did manage to wipe out the production database by mistake, so I know how it feels. Having read what happened at Gitlab, and having just made some big changes to our infrastructure at Honeybadger, I thought now would be a good time to share how we run a sizeable Postgres installation. If nothing else, this will provide some additional documentation for Starr and Josh, should I ever get hit by a bus. :)

Backups & Distaster Recovery

When we first started Honeybadger we didn't have to worry much about scaling our database. The traffic was low enough that we just deployed a primary server and a backup server and used the default configuration options (with tuning by pg_tune). The only setup beyond installing the apt packages was configuring the replication. I set up streaming replication from the primary to the secondary, and I also set up wal-e on the primary to save the WAL segments to S3. This allowed for the secondary to catch up the primary from the WALs on s3 should the replication lag get so large that the WALs weren't available on the primary. It also allowed for disaster recovery in a separate datacenter, if necessary. In the worst case scenario, we could spin up a new server in another datacenter, restore from the latest full backup generated by wal-e, then load the rest of the WALs to get the new Postgres server up to date. I later set up a hot-standby in a separate datacenter using exactly this method, and used streaming replication to keep that server in sync along with the in-datacenter replica.

Connection Pooling

As we added more customers and our workload increased, we added more sidekiq workers to handle the load. There was nothing remarkable about this until we hit the maximum number of collections allowed in our Postgres configuration. Eventually we ended up allowing 1024 connections, and at that point we decided we needed to bring in a connection pooler to take the load off the database. I evaluated pgpool and pgbouncer, and pgbouncer ended up working better for us. I really wanted the failover benefits that pgpool offered, but pgbouncer proved more stable, so I delayed my dream of having automated database failover. Using pgbouncer in transaction mode (and setting prepared_statements: false in database.yml) greatly reduced the number of active connections to Postgres, and it has been rock solid.

High Availibility

We recently moved from leasing bare metal servers to hosting everything at EC2. When I made this change, I knew it was time to stop pretending that servers don't die (since EC2 instances die all the time) and to come up with a database failover scenario that wouldn't involve one of us waking up at 3am. Achieving HA with a traditional relational database seems to be one of the eternal quests of operators, so it was with some trepidation that I once again set this goal for myself. I wasn't prepared to switch from pgbouncer to pgpool, so I looked for other options. Fortunately, in the time since I had last looked for a solution, two new, good candidates arrived on the scene: Stolon and Patroni. After evaluting both, I opted for Patroni, and I got to work integrating it into our environment. It took a bit of head scratching to figure out how to get a Patroni-controlled Postgres instance to follow and fail over from a non-Patroni-controlled instance (our primary at the old datacenter), but I eventually got it, and it worked like a charm when it came time to do the cutover.

Failover

Patroni has high-availability covered — if the leader Postgres instance dies, a leader election happens and one of the followers gets promoted to be the new leader. To handle failover, though, I had to find a way to get that change communicated to the pgbouncer instances. This task is handled by consul-template. Once the leader change is registered in Consul, the consul-template daemon running along-side each pgbouncer instance updates the pgbouncer configuration with the location of the new leader and reloads pgbouncer, which then relays database traffic to the new leader without breaking the database connection that the Rails application has with pgbouncer. Amazingly, it all seems to work. :)

Happy Servers, Happy Humans

It's been a lot of fun scaling up Honeybadger and making the infrastructure more resilient to failure. Kudos to all those who have created and contributed to the open source projects we use to make that happen!

What to do next:

Try Honeybadger for FREE

Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
Start free trial
Easy 5-minute setup — No credit card required
Get the Honeybadger newsletter

Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.

Benjamin Curtis

Ben has been developing web apps and building startups since '99, and fell in love with Ruby and Rails in 2005. Before co-founding Honeybadger, he launched a couple of his own startups: Catch the Best, to help companies manage the hiring process, and RailsKits, to help Rails developers get a jump start on their projects. Ben's role at Honeybadger ranges from bare-metal to front-end... he keeps the server lights blinking happily, builds a lot of the back-end Rails code, and dips his toes into the front-end code from time to time. When he's not working, Ben likes to hang out with his wife and kids, ride his road bike, and of course hack on open source projects. :)

More DevOps articles

Apr 23, 2024 Shipping Rails logs with Kamal and Vector
Apr 16, 2024 Observable systems with wide events
Dec 14, 2023 Deploy a Rails app to a VPS with Kamal
Jan 19, 2023 Deploying a Django application on Ubuntu
Oct 30, 2019 Managing PostgreSQL partitioned tables with Ruby
Oct 08, 2019 Configure Your App with SSM Parameter Store
Sep 19, 2019 Honeybadger Has Joined Forces With GitHub Student Developer Pack!
Aug 05, 2019 What if I called FLUSHALL on your Redis instance? 😱
Feb 14, 2019 Going deep on UUIDs and ULIDs
Oct 31, 2017 Cleanly Scaling Sidekiq

Stop wasting time manually checking logs for errors!

Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

Know when critical errors occur, and which customers are affected.
Respond instantly when your systems go down.
Improve the health of your systems over time.
Fix problems before your customers can report them!

As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

Start free trial

Simple 5-minute setup — No credit card required

Learn more

"We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
— Michael Smith, Cofounder & CTO of YvesBlue

Honeybadger is trusted by top companies like:

“Everyone is in love with Honeybadger ... the UI is spot on.”

Molly Struve, Sr. Site Reliability Engineer, Netflix