Working on an oldish Rails project, I came across some smelly ActiveRecord code that begged for some refactoring love. I also spent some time speeding up pages with slow/many database calls. Between those two experiences, I felt the inspiration to write-up some “Back to Basics” Rails Database Best Practices.
Rule #1: Let your Database do its Job
Databases are extremely feature rich and are really freakin fast when used properly. They’re great at filtering and sorting… and many other things. If the database can do it, it will do it way faster than doing the same thing in Ruby, or any other language for that matter.
You might have to learn a little bit about how DBs work, but honestly, you don’t have to go very deep to reap a majority of the benefits.
We generally use Postgres. What you choose is less important than getting to know it and using its features to make your product awesome. If you’re curious about Postgres, there are some good resources at the end of this post. We love it.
Our first, overarching rule: Let your database do what databases are good at, instead of doing it in Ruby.
Rule #2: Write efficient and chainable Scopes
Scopes are great. They allow you to create succinct helpers for accessing subsets of data that are relevant in specific situations. It’s too bad their usefulness can be significantly stifled by a couple of anti-patterns. Check out the
.active scope in this pared down example:
There are a few things going on here that should raise concern:
- The scope does not return an ActiveRecord Relation, therefore it is not chainable nor can it be used with .merge() (more on merge later).
- The scope is filtering a larger dataset down to a smaller one in Ruby.
- The scope is sorting in Ruby.
- The scope is sorting, period.
#1 Return an ActiveRecord::Relation (i.e. don’t trigger that query!)
Why is returning a Relation better? Relations are chainable. Chainable scopes are more easily reused. You can combine multiple scopes into a single query when each scope returns a Relation, e.g.
Athlete.active.men.older_than(40). Relations can also be used with
.merge(), which is the bee’s knees.
#2 Filter data in the database (not in Ruby)
Why is filtering in Ruby a bad idea? Because it’s really slow compared to filtering in the database. For small datasets (single and low double digits), it might not make a notable difference. For larger datasets, it can be huge.
Rule of Thumb: the less work Ruby has to do, the better.
Filtering in Ruby is slower because:
- Time is spent transporting the data from the database to the application server,
- ActiveRecord has to parse the query results and create AR model objects for each one, and
- Your database has indexes (right?!) which make filtering blazingly fast, Ruby does not.
#3 Sort in the database (not in Ruby)
And what about sorting? It’s going to be faster in the database, but unless you’re dealing with very large datasets, it’s not really going to be noticeable. The bigger issue with this one is that
.sort_by is going to trigger the query and therefore, we lose our Relation. That’s reason enough to do it in the database.
#4 Leave ordering out of scopes (or capture in a dedicated scope)
Since we’re trying to build reusable scopes, it’s probably unlikely that every single consumer of a scope is going to have the same ordering requirements. For that reason, I recommend leaving trivial ordering out of scopes all together. Or, move compound/non-trivial ordering to its own scope, like this:
So what would be better?
Check out the use of
.merge() (api) in the revised scope.
.merge() makes it easy to use scopes from other models that have been joined into the query, reducing potential duplication.
This version is effectively equivalent, but doesn’t have any of the drawbacks of the original. I think it’s easier to read too.
Rule #3: Reduce calls to the Database
ActiveRecord provides an easy API for doing many things with our database, but it also makes it pretty easy to do things inefficiently. The layer of abstraction hides what’s really happening.
One of those inefficiencies is a proliferation of database queries. The kicker is that during development we’re usually using a local database and small datasets. Once you push to production, suddenly the latency increases 10x (or more!) and the dataset is significantly larger. Requests get sssslllloooooowwwwww…
Best Practice: if an often visited page triggers more than a couple calls to the DB, it’s worth spending a little time to reduce the number of calls to just a few.
In many cases, this is just a matter of using
.joins(). Sometimes you’ll have to use
.having() and some aggregate functions. And in some rare cases, you might have to write some straight up SQL.
For a non-trivial query, start with the CLI. Once you have a query working in SQL, figure out how to shoehorn it into ActiveRecord. This way, you’re only wrestling with one thing at a time: first pure SQL, then ActiveRecord. On Postgres? Use pgcli instead of psql, it’s really nice.
A lot has been written about this specific topic. Here are a few good resources:
- Includes vs Joins in Rails
- Preload vs Eager Load vs Joins vs Includes
- Bullet – A gem that warns of N+1 select (and other) issues
What about Caching? Sure, caching is another way to speed up these slow pages, but it’s better to eliminate inefficient queries first. It improves the experience when there’s a cache miss and puts less load on your database in general, which is helpful as you scale.
Rule #4: Use Indexes
Databases can only do fast lookups for columns with indexes, otherwise it’s doing a sequential scan (i.e. bad news).
Rule of Thumb: Add an index on every id column as well as any column that is used in a where clause.
Adding them is easy. In a Rails migration:
Reality is a little more nuanced, as it always is. It’s possible to over-index and indexes do add some overhead on insert/update, but as a rule of thumb, it’s better to have them than not.
Want to understand what the database is doing when you trigger a query or update? You can always tack
.explain to the end of your ActiveRecord Relation and it will return the database’s query plan. See running explain for more info.
Rule #5: Use Query Objects for complicated Queries
I find that scopes are best used when they’re simple and don’t do too much. I think of them as reusable building blocks. If I need to do something more complicated, I use a Query class to encapsulate the potentially gnarly query. Here’s an example:
While it might look like this triggers multiple queries, it doesn’t. Lines 9-10 define Relations. They’re used on lines 15-16 and result in two subqueries. This is the resulting SQL (a single query):
Note, this Query returns an ActiveRecord::Relation, which is ideal because the result can be further built upon (e.g. order).
Sometimes it’s just too hard return a Relation though, or it’s not worth the effort yet because I’m prototyping. In those cases, I’ll write a Query class that returns data instead (i.e. triggers the query/queries and returns data in the form of models, hash, or something else). I use the naming convention
.data if it’s returning already-queried data, otherwise
.relation (as above).
The main benefit of the Query pattern is code organization; it’s an easy way to pull something potentially complicated out of the Model (or gasp, a Controller) and into its own file. Queries are easy to test in isolation too. They also follow the Single Responsibility Principle.
Rule #6: Avoid ad-hoc queries outside of Scopes and Query Objects
I can’t remember where I heard it first, but this rule of thumb has always stuck with me:
Restrict access to ActiveRecord generic query-building methods (e.g.
.not, etc) to scopes and Query objects.
That is, encapsulate data access into scopes and Query objects, rather than building ad-hoc queries in services, controllers, tasks, etc.
Why? An ad-hoc query embedded in a controller (or view, task, etc) is harder to test in isolation and cannot be reused. And it’s easier to reason about a codebase that follows some rules, making it easier to understand and maintain.
Rule #7: Use the Right Types
Every databases provides more datatypes than your ORM might have you believe. These are some less common Postgres types that I think are applicable to a vast majority of applications:
- Want to preserve case, but have all comparisons be case-insensitive?
citext(docs) is your type. Use it in your migrations just like you would a string.
- Need to capture a set of things (e.g. locations, tags, keywords) but a separate table and join table feels like overkill? Use the
arraytype (PG docs / Rails docs)
- Modeling a date, int, or float range? Use one of the range types (PG docs / Rails docs).
- Need a globally unique ID (primary key or otherwise)? Use the
UUIDtype (PG docs / Rails docs).
- Need to store a JSON blob, or thinking about a NoSQL DB? Use one of the
JSONtypes (PG docs / Rails docs).
These are just a couple of the many specialized types available. Check out Understanding the power of data types – PostgreSQL’s Secret Weapon to learn about others.
Rule #8: Consider your database’s Full-Text Search
Both Postgres and MySQL have full-text search capabilities. While not as rich as Elastic Search or Solr, for many problems it’s good enough and keeps the overall architecture simple because you already have a dependency on your database. We have a few blog posts on using Postgres’s full text search, check them out:
And there’s the PG docs.
Rule #9: Use stored procedures as a Last Resort
Wait what?! I’m saying use your database, but not to use stored procedures?!
Yup. There’s a time and place for them, but I think it’s better to avoid them while a product is rapidly evolving. They’re harder to work with and change, awkward to test, and almost definitely unnecessary early on. Keep it simple by leaving business logic out of your database, at least until something drives you to re-evaluate.
I believe a product will perform better and be easier to work with when the database is used to its potential.
Reducing the number of queries, using indexes, or any of the other suggestions aren’t premature optimizations IMHO. It’s using your database correctly. Of course, there is a point of diminishing returns: e.g. writing a tortuous raw SQL query to go from 3 easy queries down to 1. Use your best judgement.
Your users — and fellow developers — will thank you.