Adventures in Searching with Postgres: Part 1

Posted on by in Database, Rails, Ruby

For a recent project, we built a live-search for over 60,000 records which used both simple pattern matching and full-text search. The records we needed to search were diagnoses that are used to assign patients to a therapy. I had never done full-text search or anything real-time with that many records, so I ended up doing a lot of experimentation. These posts will cover my experience, and I hope they’ll be of value to anyone implementing their own Postgres search.

Safari

The Problem

The records we were searching were International Statistical Classification of Diseases (ICD) codes. Each record consists of a 3-7 character code, a short description averaging 11 words, and a few other data and association fields (for this post, I generated 100,000 records matching the real ICD format). We needed to be able to search by code and description, and users would be changing their search query quickly to find the right record, so it needed to be responsive. In part one, I’ll cover the code search where a user enters one or more codes (which may be partial).

Naïve Approach

Given the need for partial matches, I started off with the built-in Arel #matches method:

This generates the query:

I ran benchmarks on this using queries of different lengths and got an average of 39.5ms, which is not ideal.

Case Sensitivity

You can see from the query that we’re using the Postgres ILIKE operator, which is case-insensitive. All the codes were capitalized in the database, so I wondered if using a plain LIKE query would be any faster. Since I couldn’t find a built in Arel method for case-sensitive match, I had to use a custom operator:

As I suspected, this did speed up the queries, giving an average of 10.7ms. This was much better, but I hadn’t even added an index!

Adding an Index

I added a normal index to the codes column, and then… no change. I was still getting average query times of about 11ms. So why was there no change? I ran an EXPLAIN query (ICDCode.where(···).explain) and saw this:

We can see that it’s running a sequential scan of the table rather than using the index. The PostgreSQL manual suggests that the standard B-tree index should work with pattern matching operators as long as the pattern is anchored at the beginning of the string, which mine was. After some searching, I found that the Postgres indices are affected by the locale settings. By default, Postgres indices use the generic “C” locale, whereas my machine was using the more modern en_US.UTF-8. Since the locales didn’t match, I couldn’t use the index.

Fixing the Index: Postgres Operator Classes

Fortunately, you can specify an operator class for the index, which ignores the locale rules and just compares strings character by character.

After adding the operator class to the index, another EXPLAIN query showed that it was being used:

We can see that the condition is looking for strings >= A123 and < A124 rather than scanning all values, and our benchmark reflects the change: average query times were down to 1.3ms!

Side Note

When looking at EXPLAIN results, keep in mind the Postgres query planner may use different approaches depending on the specifics of the query. For example, using a shorter query string resulted in a different plan:

You can read more about the bitmap heap scan here.

Handling More Complex Queries

The next issue was handling queries containing multiple codes. A common technique for combining an arbitrary number of Arel conditions is using #reduce with the #and and #or methods:

This combines all the LIKE conditions and can be used in our #where call:

This generates the query:

I ran some benchmarks using different numbers of search terms:

Term Count Avg. Query Time
1 1.4ms
2 2.6ms
3 3.7ms
4 4.8ms
5 5.9ms

Looking good!

One More Tweak

These query times were plenty fast, but since we were only using a few of the columns from the table, I wondered if only selecting those would improve performance. My query then changed to:

and resulted in a 19-25% boost:

Term Count SELECT * SELECT id, code, description Improvement
1 1.4ms 1.1ms 19%
2 2.6ms 2.0ms 24%
3 3.7ms 2.8ms 24%
4 4.8ms 3.6ms 25%
5 5.9ms 4.4ms 25%

Result

After these improvements, we had query responses under 10ms and a fluid user experience. In part two, I’ll cover full-text search of the ICD code descriptions and many more benchmarks!


Feedback

  Comments: 5

  1. Pat Shaughnessy


    Great write up! Thank you. This is a really helpful example to learn from. It’s shocking how much faster the case sensitive seqscan was compared to the case insensitive one. I guess the sensitive search more quickly eliminates strings that don’t match. Anyway – another reason to be sure you have an index setup and being used.

    One question for you about the locale. You wrote:

    “By default, Postgres indices use the generic “C” locale” whereas my machine was using the more modern en_US.UTF-8. Since the locales didn’t match, I couldn’t use the index.”

    Postgres must support UTF-8 indexes. Could you have recreated your index with UTF-8 instead?


  2. Nice article! What does the #or method do here and where does it come from? a.or(code_matches(e))


  3. Do you know if wildcards on either side of the query string significantly affect benchmarks?


  4. You can set the locale of the database to en_US.UTF-8 either as an argument of initdb or with CREATE DATABASE http://www.postgresql.org/docs/9.3/static/locale.html

    Tables and columns and indices can have their own collations http://www.postgresql.org/docs/9.3/static/indexes-collations.html

    Collation can be changed on a per query basis with the COLLATION clause http://www.postgresql.org/docs/9.3/static/collation.html

Your feedback