Simple tips to make scaling your database easier as you grow

When you're working on a young project you're constantly making decisions that will make it either easier or harder to scale later. Sometimes it's good to pick the short-term gain, to accrue a bit of technical debt so you can ship faster. But other times we pick up technical debt because we didn't know there was an alternative.

I can say this because here at Honeybadger, we did a few things that made life much harder on us than it had to be. Had we understood a few key points scaling would have been much less painful.

Use UUIDs

When you say "primary key" most of us think of an auto-incrementing number. This works well for small systems, but it introduces a big problem as you scale.

At any given moment only one database server can generate the primary key. That means all writes have to go through a single server. That's bad news if you want to do thousands of writes per second.

Using UUIDs as primary keys sidesteps this problem. If you're not familiar with them, UUIDs are unique identifiers that look like this: 123e4567-e89b-12d3-a456-426655440000.

Here's how wikipedia describes them:

When generated according to the standard methods, UUIDs are for practical purposes unique, without requiring a central registration authority or coordination between the parties generating them. The probability that a UUID will be duplicated is not zero, but is so close to zero as to be negligible.

Thus, anyone can create a UUID and use it to identify something with near certainty that the identifier does not duplicate one that has already been created to identify something else, and will not be duplicated in the future. Information labeled with UUIDs by independent parties can therefore be later combined into a single database, or transmitted on the same channel, without needing to resolve conflicts between identifiers.

When you use UUIDs as primary keys, all writes no longer have to go through a single database. Instead you can spread them out across many servers.

In addition they give you the flexibility to do things like generate the id of a record before its saved to the database. This might be useful if you want to send the record to a cache or search server, but don't want to wait for the database transaction to complete.

Enabling UUIDs by default in Rails apps is easy. Just edit the config file:

# config/application.rb
config.active_record.primary_key = :uuid

You can use UUIDs for an individual table with Rails migrations:

create_table :users, id: :uuid do |t|
  t.string :name
end

It's a simple configuration option that, if you enable it when you start development, will save you a world of trouble when you try to scale. As an extra bonus, it will make it harder for bots and bad actors to guess your private URLs.

Counts and counters

Once you start looking, counts and counters are everywhere. Your email client displays the number of unread emails. Blogs have a pagination footer that uses the total number of posts to calculate the number of pages.

There are two scaling problems with counters:

Database queries like select count(*) from users are inherently slow. They literally loop through each record in the recordset to generate their results. If you have a million records, it's going to take a while.
Attempts to speed up counters by using "counter caches" do work, but they limit your ability to spread writes across many database servers.

The easiest solution is to avoid using counters wherever possible. This is much easier when you're doing the initial design.

For example, you might choose to page by date-range instead of by count. You might choose not to show a mildly-useful statistic that will be hellish to generate later. You get the idea.

Expiring & warehousing data

There is an upper limit to how much data you can store in a single postgres table given the amount of RAM, CPU and disk IO you have. That means there will come a point when you need to move stale data out of your main tables.

Let's look at a simple case. You want to delete records more than a year old in a database that is a few gigabytes large.

If you've never dealt with this issue before you might be tempted to do something like this:

MyRecords.where("created_at < ?", 1.year.ago).destroy

The problem is that this query will take days or weeks to run. Your database is just too big.

This is a particularly painful problem, because often you don't realize you have it until it's too late. Hardly anyone thinks about data purging strategies when their company is young and their database has 1,000 records.

If you do manage to plan ahead, there is an easy solution. Just partition your tables. Instead of writing all of your data to my_records you write this week's data to my_records_1 and next week's data to my_records_2. When it's time to delete last week's, just drop table my_records_1. Unlike the delete, this query completes very quickly.

It's possible to partition by fields other than date, too. Whatever makes sense for your use-case.

There's even a postgres extension named pg_partman that takes care of all the details and allows you to partition your database without changing a line of your code. Or, if you prefer to have partitioning managed in Ruby, there's a handy gem called partitionable

Parting Words

Next time you find yourself building a project from scratch, I'd encourage you to take a few moments to think about scaling. Don't obsess over it. Don't spend days worrying about whether to use HAML or ERB. But do ask yourself if there aren't any easy wins that you can pick up simply by planning ahead.

What to do next:

Try Honeybadger for FREE

Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
Start free trial
Easy 5-minute setup — No credit card required
Get the Honeybadger newsletter

Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
Include latest Ruby articles

Starr Horne

Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

@starrhorne Author Twitter

More Ruby articles

Apr 09, 2024 Account-based subdomains in Rails
Mar 12, 2024 Let's build a Hanami app
Mar 05, 2024 How to deploy a Rails app to Render
Feb 12, 2024 Visualizing Ahoy analytics in Rails
Feb 07, 2024 Building reusable UI components in Rails with ViewComponent
Jan 17, 2024 Composite primary keys in Rails
Dec 14, 2023 Deploy a Rails app to a VPS with Kamal
Nov 20, 2023 How to build your own user authentication system in Rails
Nov 02, 2023 How to organize your code using Rails Concerns
Oct 16, 2023 FactoryBot for Rails testing

Stop wasting time manually checking logs for errors!

Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

Know when critical errors occur, and which customers are affected.
Respond instantly when your systems go down.
Improve the health of your systems over time.
Fix problems before your customers can report them!

As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

Start free trial

Simple 5-minute setup — No credit card required

Learn more

"We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
— Michael Smith, Cofounder & CTO of YvesBlue

Honeybadger is trusted by top companies like:

“Everyone is in love with Honeybadger ... the UI is spot on.”

Molly Struve, Sr. Site Reliability Engineer, Netflix